Re: MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Ben
Thanks


On Sat, 19 Feb 2005 16:09:49 +0100, Daniel Naber
<[EMAIL PROTECTED]> wrote:
> On Saturday 19 February 2005 15:26, Ben wrote:
> 
> > When I try to search for phrases using the MultiFieldQueryParser v1.8
> > from CVS, it gives me NullPointerException.
> 
> This has just been fixed in SVN (I assume you mean SVN, CVS still exists
> but is read only and probably not updated anymore).
> 
> Regards
>  Daniel
> 
> --
> http://www.danielnaber.de
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Daniel Naber
On Saturday 19 February 2005 15:26, Ben wrote:

> When I try to search for phrases using the MultiFieldQueryParser v1.8
> from CVS, it gives me NullPointerException.

This has just been fixed in SVN (I assume you mean SVN, CVS still exists 
but is read only and probably not updated anymore).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MultiFieldQueryParser 1.8 isn't parsing phrases

2005-02-19 Thread Ben
Hi

When I try to search for phrases using the MultiFieldQueryParser v1.8
from CVS, it gives me NullPointerException.

Using the following keyword works:

title:"IBM backs linux"

However, it gives me the exception if I use the following keyword:

"IBM backs linux"

Any idea why? I am using this MultiFieldQueryParser with Lucene 1.4.3.
Of course I changed some of the boolean stuff to make it works with
the production release.

Thanks,
Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: analyzer effecting phrases?

2004-12-23 Thread Chris Hostetter
: Therefore I turned back to the standard analyzer and now do some replacing
: of the underscores in my ID string to avoid my original problem. This solved

maybe i'm missing something, but if you've got a field in your doc that
represents an ID, why not create that field as "NonTokenized" so you don't
have to worry about what characters the analyzer you're using thinks are
special?


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Stopwords in phrases

2004-12-21 Thread Ravi

>Are you also using the position increment of 0 for the "gram" tokens
like Nutch does?
Yes. 

I don't think considering only "gram" tokens will work for me because
Nutch uses only bi-grams. It can only have one gram per token. In my
case I have more than one and even if I get only the grams, I still will
have the same problem. 

Ravi.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Stopwords in phrases

2004-12-21 Thread Erik Hatcher
On Dec 21, 2004, at 10:41 AM, Ravi wrote:
 I want to be able to use stopwords in exact phrase searches. I have
looked at Nutch and used the same approach (replace common words with
n-grams. Look at net.nutch.analysis.CommonGrams).
  So if "to","be","or" and "not" are stop words, for the string "to be
or not to be", the analyzer produces the following tokens
[to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be,
be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to,
or-not-to-be, not-to, not-to-be, to-be]
You've gone a bit beyond what Nutch is using.  It creates bigrams, 
where you've expanded it to many more than that.

Are you also using the position increment of 0 for the "gram" tokens 
like Nutch does?

  But I'm having a problem with the search.
 when I do a search on "not to be" the analyzer is converting my search
into
  content:"not-to not-to-be to-be" because the analyzer produces the
tokens "not-to","not-to-be","to-be"
  I'm getting 0 results on this as there is no token "not-to not-to-be
to-be" in the index.
  I want just "not-to-be" from the analyzer during the search so when I
search on "not to be" I will get the document which has "not-to-be" as 
a
token.

   How can I use the same analyzer to get different results in indexing
and searching?
Nutch does some different stuff between indexing and parsing queries...
 [java] 1: [the:] [the-quick:gram]
 [java] 2: [quick:]
 [java] 3: [brown:]
 [java] 4: [fox:]
 [java] query = (+url:"the quick brown"^4.0) (+anchor:"the quick 
brown"^2.0) (+content:"the-quick quick brown")

The first four lines show the analysis of "the quick brown fox".  The 
last line is the resultant Lucene query for "the quick brown".  Notice 
that only the "content" field gets analyzed specially, and also that 
only "gram" tokens are considered in that field, not the  tokens 
if there is also a "gram".

Does this help with your situation?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Stopwords in phrases

2004-12-21 Thread Ravi
 I want to be able to use stopwords in exact phrase searches. I have
looked at Nutch and used the same approach (replace common words with
n-grams. Look at net.nutch.analysis.CommonGrams). 
  So if "to","be","or" and "not" are stop words, for the string "to be
or not to be", the analyzer produces the following tokens

[to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be,
be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to,
or-not-to-be, not-to, not-to-be, to-be]

  This is exactly what I wanted from the analyzer during indexing.
  But I'm having a problem with the search. 
 when I do a search on "not to be" the analyzer is converting my search
into 
  content:"not-to not-to-be to-be" because the analyzer produces the
tokens "not-to","not-to-be","to-be"

  I'm getting 0 results on this as there is no token "not-to not-to-be
to-be" in the index. 

  I want just "not-to-be" from the analyzer during the search so when I
search on "not to be" I will get the document which has "not-to-be" as a
token. 

   How can I use the same analyzer to get different results in indexing
and searching? 

Thanks in advance,
Ravi. 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: analyzer effecting phrases?

2004-12-20 Thread Erik Hatcher
On Dec 20, 2004, at 12:43 PM, Peter Posselt Vestergaard wrote:
Therefore I turned back to the standard analyzer and now do some 
replacing
of the underscores in my ID string to avoid my original problem. This 
solved
my phrase problem so that I can now search for phrases. However I 
still have
the problem with ",.:" described above. As far as I can see the
StandardAnalyzer (the StandardTokenizer that is) should tokenize words
without the ",.:" characters. Am I mistaken? Is there a tokenizer that 
will
do this?
StandardAnalyzer does tokenize without ",.:", though it will keep 
domain names together.  Here's an example:

$ ant -emacs AnalyzerDemo
Buildfile: build.xml
AnalyzerDemo:
  Demonstrates analysis of sample text.
  Refer to the "Analysis" chapter for much more on this
  extremely crucial topic.
Press return to continue...
String to analyze: [This string will be analyzed.]
Example with commas, colons, and dots.  You can get this code from 
http://www.lucenebook.com
Running lia.analysis.AnalyzerDemo...
Analyzing "Example with commas, colons, and dots.  You can get this 
code from http://www.lucenebook.com";
  WhitespaceAnalyzer:
[Example] [with] [commas,] [colons,] [and] [dots.] [You] [can] 
[get] [this] [code] [from] [http://www.lucenebook.com]

  SimpleAnalyzer:
[example] [with] [commas] [colons] [and] [dots] [you] [can] [get] 
[this] [code] [from] [http] [www] [lucenebook] [com]

  StopAnalyzer:
[example] [commas] [colons] [dots] [you] [can] [get] [code] [from] 
[http] [www] [lucenebook] [com]

  StandardAnalyzer:
[example] [commas] [colons] [dots] [you] [can] [get] [code] [from] 
[http] [www.lucenebook.com]


BUILD SUCCESSFUL
Total time: 7 seconds
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: analyzer effecting phrases?

2004-12-20 Thread Peter Posselt Vestergaard
Hi again
Thanks for your answer, Otis. My analyzer did not do anything else than
using the WhitespaceAnalyzer/LowerCaseFilter.
However I found out that I got problems with characters such as ",.:" when
searching because of my simple analyzer. (E.g. I would not be able to search
for "world" in the string "Hello world." as . became part of the last word).

Therefore I turned back to the standard analyzer and now do some replacing
of the underscores in my ID string to avoid my original problem. This solved
my phrase problem so that I can now search for phrases. However I still have
the problem with ",.:" described above. As far as I can see the
StandardAnalyzer (the StandardTokenizer that is) should tokenize words
without the ",.:" characters. Am I mistaken? Is there a tokenizer that will
do this?
Thanks for the help!
Regards
Peter

> Date: Mon, 20 Dec 2004 08:19:42 -0800 (PST)
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> Subject: analyzer effecting phrases?
> Content-Type: text/plain; charset=us-ascii
> 
> 
> When searching for phrases, what's important is the position of each
> token/word extracted by the Analyzer. 
> WhitespaceAnalyzer/LowerCaseFilter don't do anything with the
> positional information.  There is nothing else in your Analyzer?
> 
> In any case, the following should help you see what your Analyzer is
> doing:
> http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can
> augment the code there to provide positional information, too.
> 
> Otis
> 
> -Original Message-
> From: Peter Posselt Vestergaard [mailto:[EMAIL PROTECTED] 
> Sent: 20. december 2004 15:24
> To: '[EMAIL PROTECTED]'
> Subject: analyzer effecting phrases?
> 
> Hi
> I am building an index of texts, each related to a unique id. 
> The unique ids might contain a number of underscores which 
> will make the standardanalyzer shorten them after it sees the 
> second underscore in a row. Furthermore many of the texts I 
> am indexing is in Italian so the removal of 'trivial' words 
> done by the standard analyzer is not necessarily meaningful 
> for these texts. Therefore I am instead using an analyzer 
> made from the WhitespaceTokenizer and the LowerCaseFilter.
> This works fine for me until I try searching for a phrase. I 
> am searching for a simple phrase containing two words and 
> with double-quotes around it. I have found the phrase in one 
> of the texts so I know it should return at least one result, 
> but none is found. If I remove the double-quotes and searches 
> for the 2 words with AND between them I do find the story.
> Can anyone tell me if this is an obvious (side-)effect of not 
> using the standard analyzer? And is there a better solution 
> to my problem than using the very simple analyzer?
> Best regards
> Peter Vestergaard
> PS: I use the same analyzer for both searching and indexing 
> (of course).

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: analyzer effecting phrases?

2004-12-20 Thread Otis Gospodnetic
When searching for phrases, what's important is the position of each
token/word extracted by the Analyzer. 
WhitespaceAnalyzer/LowerCaseFilter don't do anything with the
positional information.  There is nothing else in your Analyzer?

In any case, the following should help you see what your Analyzer is
doing:
http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can
augment the code there to provide positional information, too.

Otis

--- Peter Posselt Vestergaard <[EMAIL PROTECTED]> wrote:

> Hi
> I am building an index of texts, each related to a unique id. The
> unique ids
> might contain a number of underscores which will make the
> standardanalyzer
> shorten them after it sees the second underscore in a row.
> Furthermore many
> of the texts I am indexing is in Italian so the removal of 'trivial'
> words
> done by the standard analyzer is not necessarily meaningful for these
> texts.
> Therefore I am instead using an analyzer made from the
> WhitespaceTokenizer
> and the LowerCaseFilter.
> This works fine for me until I try searching for a phrase. I am
> searching
> for a simple phrase containing two words and with double-quotes
> around it. I
> have found the phrase in one of the texts so I know it should return
> at
> least one result, but none is found. If I remove the double-quotes
> and
> searches for the 2 words with AND between them I do find the story.
> Can anyone tell me if this is an obvious (side-)effect of not using
> the
> standard analyzer? And is there a better solution to my problem than
> using
> the very simple analyzer?
> Best regards
> Peter Vestergaard
> PS: I use the same analyzer for both searching and indexing (of
> course).
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



analyzer effecting phrases?

2004-12-20 Thread Peter Posselt Vestergaard
Hi
I am building an index of texts, each related to a unique id. The unique ids
might contain a number of underscores which will make the standardanalyzer
shorten them after it sees the second underscore in a row. Furthermore many
of the texts I am indexing is in Italian so the removal of 'trivial' words
done by the standard analyzer is not necessarily meaningful for these texts.
Therefore I am instead using an analyzer made from the WhitespaceTokenizer
and the LowerCaseFilter.
This works fine for me until I try searching for a phrase. I am searching
for a simple phrase containing two words and with double-quotes around it. I
have found the phrase in one of the texts so I know it should return at
least one result, but none is found. If I remove the double-quotes and
searches for the 2 words with AND between them I do find the story.
Can anyone tell me if this is an obvious (side-)effect of not using the
standard analyzer? And is there a better solution to my problem than using
the very simple analyzer?
Best regards
Peter Vestergaard
PS: I use the same analyzer for both searching and indexing (of course).

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: phrases

2004-03-16 Thread Erik Hatcher
Try setting the slop factor on your phrase query.  This should 
accomplish what you want.  Set it to something like 10 and see what you 
get.

	Erik

On Mar 16, 2004, at 8:55 PM, Supun Edirisinghe wrote:

I have a field called buisnessname and this field contains keywords 
like
"Georgian House" "Georgian" "The Georgian House Hotel" "Georgian blah
blee bloo Hotel" along with 10,000s of other documents that have the
word 'Hotel' somewhere in the businessname field.

When I do a phrase query on "Georgian Hotel" I get only the one 
document
back. I would like to get that one back as the top result but also the
other stuff that has "Georgian" and "Hotel" too. Also, I'd like to have
"Georgian House Hotel" show up before "Georgian blah blee bloo Hotel"

Right now I do an or'd boolean queary with
each of the words in the the search string as a Term in business name
as well as
the entire search string as an exact PhraseQuery and boost that by 3.
But this doesn't allow me to ensure that "The Georgian House Hotel" 
will
come before "Georgian blah blee bloo Hotel". (there are other fields
queried besides business name) and in my instance of the index,
"Georgian blah blee bloo Hotel" comes out with a better score because 
of
other fields). I would like the the closeness of the phrase to be taken
into account. any ideas on constructing a good query for this 
situation?

thanks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


phrases

2004-03-16 Thread Supun Edirisinghe
I have a field called buisnessname and this field contains keywords like
"Georgian House" "Georgian" "The Georgian House Hotel" "Georgian blah
blee bloo Hotel" along with 10,000s of other documents that have the
word 'Hotel' somewhere in the businessname field. 

When I do a phrase query on "Georgian Hotel" I get only the one document
back. I would like to get that one back as the top result but also the
other stuff that has "Georgian" and "Hotel" too. Also, I'd like to have
"Georgian House Hotel" show up before "Georgian blah blee bloo Hotel" 

Right now I do an or'd boolean queary with 
each of the words in the the search string as a Term in business name 
as well as
the entire search string as an exact PhraseQuery and boost that by 3.

But this doesn't allow me to ensure that "The Georgian House Hotel" will
come before "Georgian blah blee bloo Hotel". (there are other fields
queried besides business name) and in my instance of the index,
"Georgian blah blee bloo Hotel" comes out with a better score because of
other fields). I would like the the closeness of the phrase to be taken
into account. any ideas on constructing a good query for this situation?


thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MultiFieldQueryParser & Phrases Problem

2003-09-21 Thread Niall Lennon
Cheers for that Erik.


From: Erik Hatcher <[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Subject: Re: MultiFieldQueryParser & Phrases Problem
Date: Sun, 21 Sep 2003 18:46:38 -0400
StandardAnalyzer removes stop words and "a" is one of them.  That is why 
you have issues with that phrase.

	Erik

On Sunday, September 21, 2003, at 06:13  PM, Niall Lennon wrote:

I'm currently using the MultiFieldQueryParser to search across four 
fields. I'm searching for phrases so i've wrapped my search text in 
quotes... everything worked
fine until i tried to execute a search ending with the 'A' and for some 
reason the A and quotes are ignored e.g.:

Analyzer analyzer = new StandardAnalyzer();
Searcher searcher = new IndexSearcher(IndexReader.open("dbindex"));
String[] fields = {"code_field", "short_description_field", 
"category_field", "manufacturer_field"};
int[] flags = {MultiFieldQueryParser.NORMAL_FIELD, 
MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD, 
MultiFieldQueryParser.NORMAL_FIELD};

Query query =MultiFieldQueryParser.parse("\"Category A\"", fields, 
flags, analyzer);

System.out.println("query -> " + query);

Hits hits = searcher.search(query);



The System output for the above is as follows:
code_field:category short_description_field:category 
category_field:category manufacturer_field:category



If i execute the same code with the following search text i get the 
expected results:
Query query =MultiFieldQueryParser.parse("\"Category Z\"", fields, 
flags, analyzer);

code_field:"category z" short_description_field:"category z" 
category_field:"category z" manufacturer_field:"category z"



I' appreicate any help with regards this matter...

_
The new MSN 8: advanced junk mail protection and 2 months FREE* 
http://join.msn.com/?page=features/junkmail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
The new MSN 8: smart spam protection and 2 months FREE*  
http://join.msn.com/?page=features/junkmail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser & Phrases Problem

2003-09-21 Thread Erik Hatcher
StandardAnalyzer removes stop words and "a" is one of them.  That is 
why you have issues with that phrase.

	Erik

On Sunday, September 21, 2003, at 06:13  PM, Niall Lennon wrote:

I'm currently using the MultiFieldQueryParser to search across four 
fields. I'm searching for phrases so i've wrapped my search text in 
quotes... everything worked
fine until i tried to execute a search ending with the 'A' and for 
some reason the A and quotes are ignored e.g.:

Analyzer analyzer = new StandardAnalyzer();
Searcher searcher = new IndexSearcher(IndexReader.open("dbindex"));
String[] fields = {"code_field", "short_description_field", 
"category_field", "manufacturer_field"};
int[] flags = {MultiFieldQueryParser.NORMAL_FIELD, 
MultiFieldQueryParser.NORMAL_FIELD, 
MultiFieldQueryParser.NORMAL_FIELD, 
MultiFieldQueryParser.NORMAL_FIELD};

Query query =MultiFieldQueryParser.parse("\"Category A\"", fields, 
flags, analyzer);

System.out.println("query -> " + query);

Hits hits = searcher.search(query);



The System output for the above is as follows:
code_field:category short_description_field:category 
category_field:category manufacturer_field:category



If i execute the same code with the following search text i get the 
expected results:
Query query =MultiFieldQueryParser.parse("\"Category Z\"", fields, 
flags, analyzer);

code_field:"category z" short_description_field:"category z" 
category_field:"category z" manufacturer_field:"category z"



I' appreicate any help with regards this matter...

_
The new MSN 8: advanced junk mail protection and 2 months FREE* 
http://join.msn.com/?page=features/junkmail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


MultiFieldQueryParser & Phrases Problem

2003-09-21 Thread Niall Lennon
I'm currently using the MultiFieldQueryParser to search across four fields. 
I'm searching for phrases so i've wrapped my search text in quotes... 
everything worked
fine until i tried to execute a search ending with the 'A' and for some 
reason the A and quotes are ignored e.g.:

Analyzer analyzer = new StandardAnalyzer();
Searcher searcher = new IndexSearcher(IndexReader.open("dbindex"));
String[] fields = {"code_field", "short_description_field", 
"category_field", "manufacturer_field"};
int[] flags = {MultiFieldQueryParser.NORMAL_FIELD, 
MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD, 
MultiFieldQueryParser.NORMAL_FIELD};

Query query =MultiFieldQueryParser.parse("\"Category A\"", fields, 
flags, analyzer);

System.out.println("query -> " + query);

Hits hits = searcher.search(query);



The System output for the above is as follows:
code_field:category short_description_field:category category_field:category 
manufacturer_field:category



If i execute the same code with the following search text i get the expected 
results:
Query query =MultiFieldQueryParser.parse("\"Category Z\"", fields, 
flags, analyzer);

code_field:"category z" short_description_field:"category z" 
category_field:"category z" manufacturer_field:"category z"



I' appreicate any help with regards this matter...

_
The new MSN 8: advanced junk mail protection and 2 months FREE* 
http://join.msn.com/?page=features/junkmail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How does Lucene handle phrases containing words that are not indexed?

2002-02-13 Thread hugo burm


How does Lucene handle phrases (literals) containing words that are not
indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
(lucene demo, my own 12 xml documents, Cocoon search) and in all cases
it looks like that when you are looking for the phrase "a specification" it
also finds documents which contain "the specification". (or: "D. Washington"
instead of "G. Washington").

Of course you can change the index behaviour and make sure there are no
stopwords, and all one-letter words and numbers are indexed. But that seems
a bad approach. A better approach: 1) find all indexed words in the phrase
and from these words find all documents containing these words. 2) check the
occurence of the phrase by opening the original document.  I am wondering:
does Lucene performs step 2)? Off course this step burns some cpu cycles.

Hugo

[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>