Re: Problems with exact matces on non-tokenized fields...

2002-11-14 Thread Stefanos Karasavvidis
> Doesn't that one do just that - treats fields differently, based on
> their name?

yes it does, but look at the question's title
"How do I write my own Analyzer?"

if someone has a problem with a non-tokenized field (which was the 
problem of the mail thread that started this) then he doesn't know that 
he has to write a custom analyzer, and so he won't be able to find the 
correct faq entry.

Moreover, the second solution Doug has proposed suites better in some 
cases and should be included, too. (Doug has written these solutions in 
a mail to the users list on 27/9/2002 9:24 p.m.)

I still think that there should be a faq entry as I propose in my 
previous email.

Moreover, there should be an addition to the faq entry
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q15

it states there that it is important to use the same analyzer during 
indexing and searching. Again this may lead to problems if a field is 
not tokenized (during indexing it will _not_ get passed through the 
analyzer, but during searching it get's passed. If the analyzer does not 
treat that field as a special case, there will be a problem.)

I don't know, maybe I'm missing something here, but it seems obvious to 
me that non tokenized fields in conjuction with analyzers produce 
problems which should be mentioned in documenation/faq etc.

Stefanos

Otis Gospodnetic wrote:

Not sure which FAQ entry you are refering to.
This one http://www.jguru.com/faq/view.jsp?EID=1006122 ?

Doesn't that one do just that - treats fields differently, based on
their name?

Otis

--- Stefanos Karasavvidis <[EMAIL PROTECTED]> wrote:
 

I came accross the same problem and I think that the faq entry you 
(Otis) propose should get a better title so that users can find more 
easily an answer to this problem.

Correct me if I'm wrong (and please forgive any wrong assumptions I
may 
have made), put the problem is on "how to query on a non tokenized
field?"

Problem explanation:
If a field is not tokenized than it is not passed through the
analyzer, 
independently of the used analyzer (that's what I understand by
looking 
into DocumentWriter.invertDocument()).
If  you construct a query with a given analyzer  (for example with 
QueryParser.parse(query, field, analyzer))  with this field, the 
queryparser does not know that this field is not tokenized and passes
it 
through the analyzer. Ther analyzer may alter the query (for example
if 
the analyzer has a stemming algorithm) and the document is not
matched 
uppon the query.

The solution:
The solution is to make sure that fields that aren't tokenized during

indexig, are not passed through the analyzer during searching. This
can 
be done in 2 ways, either by making an analyzer that takes care of
this 
according to the field,  or by constructing a TermQuery with this
field 
and adding it to the rest of the query

Example:
put here the 2 examples from Doug

Stefanos 



Otis Gospodnetic wrote:

   

Thanks, it's a FAQ entry now:

How do I write my own Analyzer?
http://www.jguru.com/faq/view.jsp?EID=1006122

Otis


--- Doug Cutting <[EMAIL PROTECTED]> wrote:


 

karl øie wrote:
  

   

I have a Lucene Document with a field named "element" which is


 

stored 
  

   

and indexed but not tokenized. The value of the field is "POST" 
(uppercase). But the only way i can match the field is by entering
 

"element:POST?" or "element:POST*" in the QueryParser class.


 

There are two ways to do this.

If this must be entered by users in the query string, then you need
to 
use a non-lowercasing analyzer for this field.  The way to do this
   

if
   

you're currently using StandardAnalyzer, is to do something like:

 public class MyAnalyzer extends Analyzer {
   private Analyzer standard = new StandardAnalyzer();
   public TokenStream tokenStream(String field, final Reader
reader) {
 if ("element".equals(field)) {// don't tokenize
   return new CharTokenizer(reader) {
 protected boolean isTokenChar(char c) { return true; }
   };
 } else {  // use standard
   

analyzer
   

   return standard.tokenStream(field, reader);
 }
   }
 }

 Analyzer analyzer = new MyAnalyzer();
 Query query = queryParser.parse("... +element:POST", analyzer);

Alternately, if this query field is added by a program, then this
   

can
   

be 
done by bypassing the analyzer for this class, building this clause
   

directly instead:

 Analyzer analyzer = new StandardAnalyzer();
 BooleanQuery query = (BooleanQuery)queryParser.parse("...",
analyzer);

 // now add the element clause
 query.add(new TermQuery(new Term("element", "POST"))), true,
false);

Pe

Re: Problems with exact matces on non-tokenized fields...

2002-11-13 Thread Stefanos Karasavvidis
I came accross the same problem and I think that the faq entry you 
(Otis) propose should get a better title so that users can find more 
easily an answer to this problem.

Correct me if I'm wrong (and please forgive any wrong assumptions I may 
have made), put the problem is on "how to query on a non tokenized field?"

Problem explanation:
If a field is not tokenized than it is not passed through the analyzer, 
independently of the used analyzer (that's what I understand by looking 
into DocumentWriter.invertDocument()).
If  you construct a query with a given analyzer  (for example with 
QueryParser.parse(query, field, analyzer))  with this field, the 
queryparser does not know that this field is not tokenized and passes it 
through the analyzer. Ther analyzer may alter the query (for example if 
the analyzer has a stemming algorithm) and the document is not matched 
uppon the query.

The solution:
The solution is to make sure that fields that aren't tokenized during 
indexig, are not passed through the analyzer during searching. This can 
be done in 2 ways, either by making an analyzer that takes care of this 
according to the field,  or by constructing a TermQuery with this field 
and adding it to the rest of the query

Example:
put here the 2 examples from Doug

Stefanos 



Otis Gospodnetic wrote:

Thanks, it's a FAQ entry now:

How do I write my own Analyzer?
http://www.jguru.com/faq/view.jsp?EID=1006122

Otis


--- Doug Cutting <[EMAIL PROTECTED]> wrote:
 

karl øie wrote:
   

I have a Lucene Document with a field named "element" which is
 

stored 
   

and indexed but not tokenized. The value of the field is "POST" 
(uppercase). But the only way i can match the field is by entering 
"element:POST?" or "element:POST*" in the QueryParser class.
 

There are two ways to do this.

If this must be entered by users in the query string, then you need
to 
use a non-lowercasing analyzer for this field.  The way to do this if

you're currently using StandardAnalyzer, is to do something like:

  public class MyAnalyzer extends Analyzer {
private Analyzer standard = new StandardAnalyzer();
public TokenStream tokenStream(String field, final Reader
reader) {
  if ("element".equals(field)) {// don't tokenize
return new CharTokenizer(reader) {
  protected boolean isTokenChar(char c) { return true; }
};
  } else {  // use standard analyzer
return standard.tokenStream(field, reader);
  }
}
  }

  Analyzer analyzer = new MyAnalyzer();
  Query query = queryParser.parse("... +element:POST", analyzer);

Alternately, if this query field is added by a program, then this can
be 
done by bypassing the analyzer for this class, building this clause 
directly instead:

  Analyzer analyzer = new StandardAnalyzer();
  BooleanQuery query = (BooleanQuery)queryParser.parse("...",
analyzer);

  // now add the element clause
  query.add(new TermQuery(new Term("element", "POST"))), true,
false);

Perhaps this should become an FAQ...

Doug


--
To unsubscribe, e-mail:  

For additional commands, e-mail:


   



__
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail:   
For additional commands, e-mail: 


 



--
To unsubscribe, e-mail:   
For additional commands, e-mail: