RE: Starts With x and Ends With x Queries
i would say that matching root words in German compounds is a text analysis application. Herb... -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 11:08 AM To: Lucene Users List Subject: Re: Starts With x and Ends With x Queries That might be true ... but our application is not a text analysis aplication, and it is also not intended to be a search engine. We use lucene just to index our pages. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
Erik Hatcher wrote: On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote: Hi Erik, I'm not changing any functionality. WildcardQuery will still support leading wildcard characters, QueryParser will still disallow them. All I'm going to change is the javadoc that makes it sound like WildcardQuery does not support leading wildcard characters. Erik From what I was reading in the mailing list there are more lucene users that would like to be able to construct sufix queries. They are very usefull for german language, because it has many long composite words , created by concatenation of other simple words. This is one of the requirements of our system. Therefore I needed to patch lucene to make QueryParser to allow SufixQueries. Now I will need to update lucene library to the latest version, and I need to patch it again. Do you think it will be possible in the future to have a field in QueryParser, boolean ALLOW_SUFFIX_QUERIES? I have no objections to that type of switch. Please submit a path to QueryParser.jj that implements this as an option with the default to disallow suffix queries, along with a test case and I'd be happy to apply it. I'm pleased to hear that. I'm not very skilled in writing .jj files but I will try to do it in next days, Sergiu Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
Chong, Herb wrote: commercial text analytics tools including search engines usually tokenize with splitting of compound words for German. Herb That might be true ... but our application is not a text analysis aplication, and it is also not intended to be a search engine. We use lucene just to index our pages. Best, Sergiu -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 10:38 AM To: Lucene Users List Subject: Re: Starts With x and Ends With x Queries From what I was reading in the mailing list there are more lucene users that would like to be able to construct sufix queries. They are very usefull for german language, because it has many long composite words , created by concatenation of other simple words. This is one of the requirements of our system. Therefore I needed to patch lucene to make QueryParser to allow SufixQueries. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Starts With x and Ends With x Queries
commercial text analytics tools including search engines usually tokenize with splitting of compound words for German. Herb -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 10:38 AM To: Lucene Users List Subject: Re: Starts With x and Ends With x Queries From what I was reading in the mailing list there are more lucene users that would like to be able to construct sufix queries. They are very usefull for german language, because it has many long composite words , created by concatenation of other simple words. This is one of the requirements of our system. Therefore I needed to patch lucene to make QueryParser to allow SufixQueries. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote: Hi Erik, I'm not changing any functionality. WildcardQuery will still support leading wildcard characters, QueryParser will still disallow them. All I'm going to change is the javadoc that makes it sound like WildcardQuery does not support leading wildcard characters. Erik From what I was reading in the mailing list there are more lucene users that would like to be able to construct sufix queries. They are very usefull for german language, because it has many long composite words , created by concatenation of other simple words. This is one of the requirements of our system. Therefore I needed to patch lucene to make QueryParser to allow SufixQueries. Now I will need to update lucene library to the latest version, and I need to patch it again. Do you think it will be possible in the future to have a field in QueryParser, boolean ALLOW_SUFFIX_QUERIES? I have no objections to that type of switch. Please submit a path to QueryParser.jj that implements this as an option with the default to disallow suffix queries, along with a test case and I'd be happy to apply it. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
Hi Erik, I'm not changing any functionality. WildcardQuery will still support leading wildcard characters, QueryParser will still disallow them. All I'm going to change is the javadoc that makes it sound like WildcardQuery does not support leading wildcard characters. Erik From what I was reading in the mailing list there are more lucene users that would like to be able to construct sufix queries. They are very usefull for german language, because it has many long composite words , created by concatenation of other simple words. This is one of the requirements of our system. Therefore I needed to patch lucene to make QueryParser to allow SufixQueries. Now I will need to update lucene library to the latest version, and I need to patch it again. Do you think it will be possible in the future to have a field in QueryParser, boolean ALLOW_SUFFIX_QUERIES? Thanks for understanding, Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
I implemented this concept for my ends with query. It works very well! - Original Message - From: "Chris Hostetter" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Friday, February 04, 2005 9:37 PM Subject: Re: Starts With x and Ends With x Queries > > : Also keep in mind that QueryParser only allows a trailing asterisk, > : creating a PrefixQuery. However, if you use a WildcardQuery directly, > : you can use an asterisk as the starting character (at the risk of > : performance). > > On the issue of "ends with" wildcard queries, I wanted to throw out and > idea that i've seen used to deal with matches like this in other systems. > I've never acctually tried this with Lucene, but I've seen it used > effectively with other systems where the goal is to "sort" strings by the > least significant (ie: right most) characters first. I think it could > apply nicely to people who have compelling needs for efficent 'ends with' > queries. > > > > Imagine you have a field call name, which you can already do efficient > prefix matching on using the PrefixQuery class. Your docs and query may > look something like this... > >D1> name:"Adam Smith" age:13 state:CA ... >D2> name:"Joe Bob" age:42 state:WA ... >D3> name:"John Adams" age:35 state:NV ... >D3> name:"Sue Smith" age:33 state:CA ... > > ...and your queries may look something like... > >Query q1 = new PrefixQuery(new Term("name","J*")); >Query q2 = new PrefixQuery(new Term("name","Sue*")); > > If you want to start doing suffix queries (ie: all names ending with > "s", or all names ending with "Smith") one approach would be to use > WildcarQuery, which as Erik mentioned, will allow you to use a quey Term > that starts with a "*". ie... > >Query q3 = new WildcardQuery(new Term("name","*s")); >Query q4 = new WildcardQuery(new Term("name","*Smith")); > > (NOTE: Erik says you can do this, but the docs for WildcardQuery say you > can't I'll assume the docs are wrong and Erik is correct.) > > The problem is that this is horrendously inefficient. In order to find > the docs that contain Terms which match your suffix, WildcardQuery must > first identify what all of those Terms are, by iterating over every Term > in your index to see if they match the suffix. This is much slower then a > PrefixQuery, or even a WildcardQuery that has just 1 initial character > before a "*" (ie: "s*foobar"), because it can then seek to directly to the > first Term that starts with that character, and also stop iterating as > soon as it encounters a Term that no longer begins with that character. > > Which leads me to my point: if you denormalize your data so that you store > both the Term you want, and the *reverse* of the term you want, then a > Suffix query is just a Prefix query on a reversed field -- by sacrificing > space, you can get all the speed efficiencies of a PrefixQuery when doing > a SuffixQuery... > >D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ... >D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ... >D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ... >D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ... > >Query q1 = new PrefixQuery(new Term("name","J*")); >Query q2 = new PrefixQuery(new Term("name","Sue*")); >Query q3 = new PrefixQuery(new Term("rname","s*")); >Query q4 = new PrefixQuery(new Term("rname","htimS*")); > > > (If anyone sees a flaw in my theory, please chime in) > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
On Feb 7, 2005, at 2:07 AM, sergiu gordea wrote: Hi Erick, "In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards * or ?." I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk performance issues. I'm going to change "must" to "should". Will this change available in the next realease of lucene? How do you plan to implement this? Will this be available as an atributte of QueryParser? I'm not changing any functionality. WildcardQuery will still support leading wildcard characters, QueryParser will still disallow them. All I'm going to change is the javadoc that makes it sound like WildcardQuery does not support leading wildcard characters. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
Hi Erick, "In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards * or ?." I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk performance issues. I'm going to change "must" to "should". Will this change available in the next realease of lucene? How do you plan to implement this? Will this be available as an atributte of QueryParser? Best, Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
: book Managing Gigabytes, making "*string*" queries drastically more : efficient for searching (though also impacting index size). Take the : term "cat". It would be indexed with all rotated variations with an : end of word marker added: ... : The query for "*at*" would be preprocessed and rotated such that the : wildcards are collapsed at the end to search for "at*" as a : PrefixQuery. A wildcard in the middle of a string like "c*t" would : become a prefix query for "t$c*". That's a pretty slick trick. Considering how many Terms the index would wind up containing in order to denormalize the data in that way, I wonder if it would be more practicle to index each of the characters as a seperate term, with the word repeated after the "end of word" character, making wildcard searches into "phase" searches (after doing preprocessing and rotating as you described). Ie, index "cat" as: c a t $ c a t search for "*at*" as a phrase search for "a t" search for "*at" as a phrase search for "a t $" search for "c*t" as a phrase search for "t $ c" ...i'm fairly certain that would keep the index size much smaller (the number of terms would be much smaller, while the average term frequence wouldn't really increase), but i'm not sure if it would actaully be any faster. it depends on the algorithm/performace of PhraseQuery -- which is something I haven't really looked into. It could very well be significantly slower. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
On Feb 4, 2005, at 9:37 PM, Chris Hostetter wrote: If you want to start doing suffix queries (ie: all names ending with "s", or all names ending with "Smith") one approach would be to use WildcarQuery, which as Erik mentioned, will allow you to use a quey Term that starts with a "*". ie... Query q3 = new WildcardQuery(new Term("name","*s")); Query q4 = new WildcardQuery(new Term("name","*Smith")); (NOTE: Erik says you can do this, but the docs for WildcardQuery say you can't I'll assume the docs are wrong and Erik is correct.) I assume you mean this comment on WildcardQuery's javadocs: "In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards * or ?." I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk performance issues. I'm going to change "must" to "should". And yes, WildcardQuery itself supports a leading wildcard character exactly as you have shown. Which leads me to my point: if you denormalize your data so that you store both the Term you want, and the *reverse* of the term you want, then a Suffix query is just a Prefix query on a reversed field -- by sacrificing space, you can get all the speed efficiencies of a PrefixQuery when doing a SuffixQuery... D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ... D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ... D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ... D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ... Query q1 = new PrefixQuery(new Term("name","J*")); Query q2 = new PrefixQuery(new Term("name","Sue*")); Query q3 = new PrefixQuery(new Term("rname","s*")); Query q4 = new PrefixQuery(new Term("rname","htimS*")); (If anyone sees a flaw in my theory, please chime in) This trick has been mentioned on this list before, and is a good one. I'll go one step further and mention another technique I found in the book Managing Gigabytes, making "*string*" queries drastically more efficient for searching (though also impacting index size). Take the term "cat". It would be indexed with all rotated variations with an end of word marker added: cat$ at$c t$ca $cat The query for "*at*" would be preprocessed and rotated such that the wildcards are collapsed at the end to search for "at*" as a PrefixQuery. A wildcard in the middle of a string like "c*t" would become a prefix query for "t$c*". Has anyone tried this technique with Lucene? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
I sent this to the wrong address. Sorry. Peter Pimley wrote: Well done. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
Well done. I was so annoyed with the humiliation-for-kicks this afternoon that I just practised my self-destruction technicques with some friends this evening ;) As for configuration, java.lang.system.getenv will give you access to an environment variable. http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
: Also keep in mind that QueryParser only allows a trailing asterisk, : creating a PrefixQuery. However, if you use a WildcardQuery directly, : you can use an asterisk as the starting character (at the risk of : performance). On the issue of "ends with" wildcard queries, I wanted to throw out and idea that i've seen used to deal with matches like this in other systems. I've never acctually tried this with Lucene, but I've seen it used effectively with other systems where the goal is to "sort" strings by the least significant (ie: right most) characters first. I think it could apply nicely to people who have compelling needs for efficent 'ends with' queries. Imagine you have a field call name, which you can already do efficient prefix matching on using the PrefixQuery class. Your docs and query may look something like this... D1> name:"Adam Smith" age:13 state:CA ... D2> name:"Joe Bob" age:42 state:WA ... D3> name:"John Adams" age:35 state:NV ... D3> name:"Sue Smith" age:33 state:CA ... ...and your queries may look something like... Query q1 = new PrefixQuery(new Term("name","J*")); Query q2 = new PrefixQuery(new Term("name","Sue*")); If you want to start doing suffix queries (ie: all names ending with "s", or all names ending with "Smith") one approach would be to use WildcarQuery, which as Erik mentioned, will allow you to use a quey Term that starts with a "*". ie... Query q3 = new WildcardQuery(new Term("name","*s")); Query q4 = new WildcardQuery(new Term("name","*Smith")); (NOTE: Erik says you can do this, but the docs for WildcardQuery say you can't I'll assume the docs are wrong and Erik is correct.) The problem is that this is horrendously inefficient. In order to find the docs that contain Terms which match your suffix, WildcardQuery must first identify what all of those Terms are, by iterating over every Term in your index to see if they match the suffix. This is much slower then a PrefixQuery, or even a WildcardQuery that has just 1 initial character before a "*" (ie: "s*foobar"), because it can then seek to directly to the first Term that starts with that character, and also stop iterating as soon as it encounters a Term that no longer begins with that character. Which leads me to my point: if you denormalize your data so that you store both the Term you want, and the *reverse* of the term you want, then a Suffix query is just a Prefix query on a reversed field -- by sacrificing space, you can get all the speed efficiencies of a PrefixQuery when doing a SuffixQuery... D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ... D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ... D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ... D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ... Query q1 = new PrefixQuery(new Term("name","J*")); Query q2 = new PrefixQuery(new Term("name","Sue*")); Query q3 = new PrefixQuery(new Term("rname","s*")); Query q4 = new PrefixQuery(new Term("rname","htimS*")); (If anyone sees a flaw in my theory, please chime in) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Starts With x and Ends With x Queries
It matches both because you're tokenizing the name field. In both documents, the name field has a "testing" term in it (it gets lowercased also). A PrefixQuery matches terms that start with the prefix. Use an untokenized field type (Field.Keyword) if you want to keep the entire original string as-is for searching purposes - however you'd have issues with case-sensitivity in your example. Also keep in mind that QueryParser only allows a trailing asterisk, creating a PrefixQuery. However, if you use a WildcardQuery directly, you can use an asterisk as the starting character (at the risk of performance). Erik On Feb 4, 2005, at 7:50 PM, Luke Shannon wrote: Hello; I have these two documents: Text Keyword Text Text Text Text Text Text Text Text Text Text Text Text Keyword Keyword Text Text Text Text Text Text Text Text Brand Ide.> Text Text I would like to be able to match a name fields that starts with testing (specifically) and those that end with it. I thought the below code would parse to a Prefix Query that would satisfy my starting requirment (maybe I don't understand what this query is for). But this matches both. Query query = QueryParser.parse("testing*", "name", new StandardAnalyzer()); Has anyone done this before? Any tips? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Starts With x and Ends With x Queries
Hello; I have these two documents: Text Keyword Text Text Text Text Text Text Text Text Text Text Text Text Keyword Keyword Text Text Text Text Text Text Text Text Text Text I would like to be able to match a name fields that starts with testing (specifically) and those that end with it. I thought the below code would parse to a Prefix Query that would satisfy my starting requirment (maybe I don't understand what this query is for). But this matches both. Query query = QueryParser.parse("testing*", "name", new StandardAnalyzer()); Has anyone done this before? Any tips? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]