Re: Case Sensitivity

Michael McCandless Wed, 27 Aug 2008 16:27:22 -0700

OK I'll open an issue to do this renaming in 3.0, which actually meanswe do the renaming in 2.4 or 2.9 (deprecating the old ones) then in3.0 removing the old ones.


Mike

On Aug 27, 2008, at 11:08 AM, Otis Gospodnetic wrote:

Nah, I think the names are fine, I simply forgot. I looked at thejavadocs, it clearly says NO_NORMS doesn't get passed through anAnalyzer. Maybe in 3.0 we can switch to NOT_ANALYZED, as suggested,to reflect reality more closely.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

From: Michael McCandless <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, August 27, 2008 5:36:46 AM
Subject: Re: Case Sensitivity

Actually, as confusing as it is, Field.Index.NO_NORMS means
Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).

Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

Mike

Otis Gospodnetic wrote:

Dino, you lost me half-way through your email :(

NO_NORMS does not mean the field is not tokenized.
UN_TOKENIZED does mean the field is not tokenized.


Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

From: Dino Korah
To: [email protected]
Sent: Tuesday, August 26, 2008 9:17:49 AM
Subject: RE: Case Sensitivity

I think I should rephrase my question.

[ Context: Using out of the box StandardAnalyzer for indexing and
searching.
]

Is it right to say that a field, if either UN_TOKENIZED orNO_NORMS-

ized (
field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
Which means that when we search, it gets thru the analyzer and we
need to
analyze them differently in the analyzer we use for searching?
Doesn't it mean that a setOmitNorms(true) field also doesn't get
tokenized?

What is the best solution if one was to add a set of fields
UN_TOKENIZED and
others TOKENIZED, of the later set a few with setOmitNorms(true)
(the index
writer is plain StandardAnalyzer based)? A per field analyzer at
query time
?!

Many thanks,
Dino


-----Original Message-----
From: Dino Korah [mailto:[EMAIL PROTECTED]
Sent: 26 August 2008 12:12
To: '[email protected]'
Subject: RE: Case Sensitivity

A little more case sensitivity questions.

Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5
and
on this thread, is it right to say that a field, if either
UN_TOKENIZED or
NO_NORMS-ized, it doesn't get analyzed while indexing? Which means
we need
to case-normalize (down-case) those fields before hand?

Doest it mean that if I can afford, I should use norms.

Many thanks,
Dino



-----Original Message-----
From: Steven A Rowe [mailto:[EMAIL PROTECTED]
Sent: 19 August 2008 17:43
To: [email protected]
Subject: RE: Case Sensitivity

Hi Dino,

I think you'd benefit from reading some FAQ answers, like:

"Why is it important to use the same analyzer type during indexing
and
search?"

4472d10961ba63c>

Also, have a look at the AnalysisParalysis wiki page for somehints:



On 08/19/2008 at 8:57 AM, Dino Korah wrote:

From the discussion here what I could understand was, if I amusing
StandardAnalyzer on TOKENIZED fields, for both Indexing and
Querying,
I shouldn't have any problems with cases.


If by "shouldn't have problems with cases" you mean "can match
case-insensitively", then this is true.

But if I have any UN_TOKENIZED fields there will be problems ifI donot case-normalize them myself before adding them as a field tothe
document.


Again, assuming that by "case-normalize" you mean "downcase", and
that you
want case-insensitive matching, and that you use the
StandardAnalyzer (or
some other downcasing analyzer) at query-time, then this is true.

In my case I have a mixed scenario. I am indexing emails and the
email
addresses are indexed UN_TOKENIZED. I do have a second set ofcustom
tokenized field, which keep the tokens in individual fields with
same
name.

[...]

Does it mean that where ever I use UN_TOKENIZED, they do not get
through the StandardAnalyzer before getting Indexed, but they do
when
they are searched on?


This is true.

If that is the case, Do I need to normalise them before adding to
document?


If you want case-insensitive matching, then yes, you do need to
normalize
them before adding them to the document.

I also would like to know if it is better to employ anEmailAnalyzerthat makes a TokenStream out of the given email address, ratherthan
using a simplistic function that gives me a list of string pieces
and
adding them one by one. With searches, would both the approaches
give
same result?


Yes, both approaches give the same result.  When you add string
pieces
one-by-one, you are adding multiple same-named fields. By contrast,
the
EmailAnalyzer approach would add a single field, and would allow
you to
control positions (via Token.setPositionIncrement():

ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
Also, if
you make up an EmailAnalyzer, you can use it to search against your
tokenized email field, along with other analyzer(s) on other
field(s), using
the PerFieldAnalyzerWrapper

AnalyzerWrapper.html>.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Case Sensitivity

Reply via email to