case sensitivity

2007-04-26 Thread Michael Kimsal

I've looked through the mailing lists and can't find much of anything
regarding case sensitivity.  It
seems SOLR is case sensitive by default - I'm using the default settings
with a very basic schema - just text fields.

Is there any way to tell the query parser to be case insensitive during a
query?  Or do I have to reindex
all my data again with lowercase values?



--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-26 Thread Erik Hatcher


On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote:

I've looked through the mailing lists and can't find much of anything
regarding case sensitivity.  It
seems SOLR is case sensitive by default - I'm using the default  
settings

with a very basic schema - just text fields.


All depends on the analysis you have set up for the fields.  If  
you're indexing "string"-type fields in the default example schema,  
there is effectively no analysis so searches must be exact matches  
case and all.


Is there any way to tell the query parser to be case insensitive  
during a

query?  Or do I have to reindex
all my data again with lowercase values?


Terms are indexed in a case-sensitive manner, so if you need case  
insensitivity you need to lowercase on the way in and on querying.


Erik




Re: case sensitivity

2007-04-26 Thread Michael Kimsal

I was just writing a followup.

I'm using the default text field type

   
 
   
   
   
   
   
   
   
 
 
   
   
   
   
   
   
   
 
   


That looks to me like it's got LowerCaseFilterFactory in the query analyzer
and the index analyzer.

I'm still digging in to this, but are there any other things to look for
anyone can point me to?  (Thanks Erik!)




On 4/26/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:



On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote:
> I've looked through the mailing lists and can't find much of anything
> regarding case sensitivity.  It
> seems SOLR is case sensitive by default - I'm using the default
> settings
> with a very basic schema - just text fields.

All depends on the analysis you have set up for the fields.  If
you're indexing "string"-type fields in the default example schema,
there is effectively no analysis so searches must be exact matches
case and all.

> Is there any way to tell the query parser to be case insensitive
> during a
> query?  Or do I have to reindex
> all my data again with lowercase values?

Terms are indexed in a case-sensitive manner, so if you need case
insensitivity you need to lowercase on the way in and on querying.

Erik






--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-26 Thread Michael Kimsal

type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) ) )
and
type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) ) )

Is this to do with the wildcards?

Actually, I've just answered my own question.

type:changelog AND ( ( (listing:fox) ) )
and
type:changelog AND ( ( (listing:Fox) ) )

give the same results.

But adding in the or listing:fox* or listing:*fox is always case-sensitive.
However,
http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems
to say that wildcard searches are not case-sensitive.

Unless someone can point out a way around this, it seems I'll need to
manually reindex and lower-case everything on the way in, then reformat my
search queries to be lower-case as well.



On 4/26/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:


I was just writing a followup.

I'm using the default text field type


  









  
  








  



That looks to me like it's got LowerCaseFilterFactory in the query
analyzer and the index analyzer.

I'm still digging in to this, but are there any other things to look for
anyone can point me to?  (Thanks Erik!)




On 4/26/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
>
>
> On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote:
> > I've looked through the mailing lists and can't find much of anything
> > regarding case sensitivity.  It
> > seems SOLR is case sensitive by default - I'm using the default
> > settings
> > with a very basic schema - just text fields.
>
> All depends on the analysis you have set up for the fields.  If
> you're indexing "string"-type fields in the default example schema,
> there is effectively no analysis so searches must be exact matches
> case and all.
>
> > Is there any way to tell the query parser to be case insensitive
> > during a
> > query?  Or do I have to reindex
> > all my data again with lowercase values?
>
> Terms are indexed in a case-sensitive manner, so if you need case
> insensitivity you need to lowercase on the way in and on querying.
>
> Erik
>
>
>


--
Michael Kimsal
http://webdevradio.com





--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-26 Thread Michael Kimsal

My colleague, after some digging, found in SolrQueryParser

(around line 62)
setLowercaseExpandedTerms(false);

The default for Lucene is true.  Was this intentional?  Or an oversight?

Perhaps it's not related to my problem, but it seems that it might be.

Thanks in advance!

On 4/26/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:


type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) )
)
and
type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) )
)

Is this to do with the wildcards?

Actually, I've just answered my own question.

type:changelog AND ( ( (listing:fox) ) )
and
type:changelog AND ( ( (listing:Fox) ) )

give the same results.

But adding in the or listing:fox* or listing:*fox is always
case-sensitive. However,
http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems
 to say that wildcard searches are not case-sensitive.

Unless someone can point out a way around this, it seems I'll need to
manually reindex and lower-case everything on the way in, then reformat my
search queries to be lower-case as well.



On 4/26/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:
>
> I was just writing a followup.
>
> I'm using the default text field type
>
> 
>   
>
>
> 
> 
> 
> 
>
>
> 
> 
> 
>   
>   
> 
> 
> 
> 
>
>
> 
> 
> 
>   
> 
>
>
> That looks to me like it's got LowerCaseFilterFactory in the query
> analyzer and the index analyzer.
>
> I'm still digging in to this, but are there any other things to look for
> anyone can point me to?  (Thanks Erik!)
>
>
>
>
> On 4/26/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> >
> >
> > On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote:
> > > I've looked through the mailing lists and can't find much of
> > anything
> > > regarding case sensitivity.  It
> > > seems SOLR is case sensitive by default - I'm using the default
> > > settings
> > > with a very basic schema - just text fields.
> >
> > All depends on the analysis you have set up for the fields.  If
> > you're indexing "string"-type fields in the default example schema,
> > there is effectively no analysis so searches must be exact matches
> > case and all.
> >
> > > Is there any way to tell the query parser to be case insensitive
> > > during a
> > > query?  Or do I have to reindex
> > > all my data again with lowercase values?
> >
> > Terms are indexed in a case-sensitive manner, so if you need case
> > insensitivity you need to lowercase on the way in and on querying.
> >
> > Erik
> >
> >
> >
>
>
> --
> Michael Kimsal
> http://webdevradio.com
>



--
Michael Kimsal
http://webdevradio.com





--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-26 Thread Erik Hatcher


On Apr 26, 2007, at 6:03 PM, Michael Kimsal wrote:

My colleague, after some digging, found in SolrQueryParser

(around line 62)
setLowercaseExpandedTerms(false);

The default for Lucene is true.  Was this intentional?  Or an  
oversight?


I was just about to respond that this is likely the issue with your  
non-totally-lowercased wildcard terms.


I don't consider it an oversight, but rather this whole analysis  
business and wildcards are things that vary from project to project  
on how they should be handled.  If you, have, for example, a string  
field and want to do prefixed queries on them (trailing asterisk) you  
wouldn't want the term to be lowercased.


I think we should open up as many of the switches as we can to  
QueryParser, allowing users to tinker with them if they want, setting  
the defaults to the most common reasonable settings we can agree upon.


Erik



Re: case sensitivity

2007-04-26 Thread Michael Kimsal

We're (and by 'we' I mean my esteemed colleague!) working on patching a few
of these items to be in the solrconf.xml file and should likely have some
patches submitted next week.  It's being done on 'company time' and I'm not
sure about the exact policy/procedure for this sort of thing here (or
indeed, if there is one at all).


On 4/26/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:



On Apr 26, 2007, at 6:03 PM, Michael Kimsal wrote:
> My colleague, after some digging, found in SolrQueryParser
>
> (around line 62)
> setLowercaseExpandedTerms(false);
>
> The default for Lucene is true.  Was this intentional?  Or an
> oversight?

I was just about to respond that this is likely the issue with your
non-totally-lowercased wildcard terms.

I don't consider it an oversight, but rather this whole analysis
business and wildcards are things that vary from project to project
on how they should be handled.  If you, have, for example, a string
field and want to do prefixed queries on them (trailing asterisk) you
wouldn't want the term to be lowercased.

I think we should open up as many of the switches as we can to
QueryParser, allowing users to tinker with them if they want, setting
the defaults to the most common reasonable settings we can agree upon.

Erik





--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-27 Thread Yonik Seeley

On 4/26/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:

We're (and by 'we' I mean my esteemed colleague!) working on patching a few
of these items to be in the solrconf.xml file and should likely have some
patches submitted next week.  It's being done on 'company time' and I'm not
sure about the exact policy/procedure for this sort of thing here (or
indeed, if there is one at all).


That's fine, as long as your company has agreed to contribute back the
patch (under the Apache license).  Apache enjoys a lot of business
support (being business friendly) and a *lot* of contributions is done
on company time.

Anything really big would probably need a CLA, but patches only
require clicking the "grant license to ASF" button in JIRA.

-Yonik


Re: case sensitivity

2007-04-27 Thread Michael Kimsal

Can you point me to the process for submitting these small patches?  I'm
looking at the jira site but don't see much of anything there outlining a
process for submitting patches.  Sorry to be so basic about this, but I'm
trying to follow correct procedures on both sides of the aisle, so to speak.


On 4/27/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 4/26/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:
> We're (and by 'we' I mean my esteemed colleague!) working on patching a
few
> of these items to be in the solrconf.xml file and should likely have
some
> patches submitted next week.  It's being done on 'company time' and I'm
not
> sure about the exact policy/procedure for this sort of thing here (or
> indeed, if there is one at all).

That's fine, as long as your company has agreed to contribute back the
patch (under the Apache license).  Apache enjoys a lot of business
support (being business friendly) and a *lot* of contributions is done
on company time.

Anything really big would probably need a CLA, but patches only
require clicking the "grant license to ASF" button in JIRA.

-Yonik





--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-27 Thread Ryan McKinley

Michael Kimsal wrote:

Can you point me to the process for submitting these small patches?  I'm
looking at the jira site but don't see much of anything there outlining a
process for submitting patches.  Sorry to be so basic about this, but I'm
trying to follow correct procedures on both sides of the aisle, so to 
speak.




Check: http://wiki.apache.org/solr/HowToContribute

Essentially you will create a new issue on JIRA, then upload a svn diff 
to that issue.


holler if you have any troubles

ryan



Re: case sensitivity

2007-04-27 Thread Michael Kimsal

So I just create my own 'issue' first?  OK.  Thanks.

On 4/27/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:


Michael Kimsal wrote:
> Can you point me to the process for submitting these small patches?  I'm
> looking at the jira site but don't see much of anything there outlining
a
> process for submitting patches.  Sorry to be so basic about this, but
I'm
> trying to follow correct procedures on both sides of the aisle, so to
> speak.
>

Check: http://wiki.apache.org/solr/HowToContribute

Essentially you will create a new issue on JIRA, then upload a svn diff
to that issue.

holler if you have any troubles

ryan





--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-27 Thread Michael Kimsal

What's the procedure then for something to get included in the next
release?

Thanks again all!

On 4/27/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:


So I just create my own 'issue' first?  OK.  Thanks.

On 4/27/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
>
> Michael Kimsal wrote:
> > Can you point me to the process for submitting these small
> patches?  I'm
> > looking at the jira site but don't see much of anything there
> outlining a
> > process for submitting patches.  Sorry to be so basic about this, but
> I'm
> > trying to follow correct procedures on both sides of the aisle, so to
> > speak.
> >
>
> Check: http://wiki.apache.org/solr/HowToContribute
>
> Essentially you will create a new issue on JIRA, then upload a svn diff
> to that issue.
>
> holler if you have any troubles
>
> ryan
>
>


--
Michael Kimsal
http://webdevradio.com





--
Michael Kimsal
http://webdevradio.com


Re: case sensitivity

2007-04-27 Thread Otis Gospodnetic
Once the code/patch in the issue is put/committed to SVN, it means it will be 
in the next release.  You get your patch committed faster if it's clear, well 
written and explained, if it comes with a unit test if it's a code change, and 
so on.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Michael Kimsal <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, April 27, 2007 1:47:06 PM
Subject: Re: case sensitivity

What's the procedure then for something to get included in the next
release?

Thanks again all!

On 4/27/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:
>
> So I just create my own 'issue' first?  OK.  Thanks.
>
> On 4/27/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> >
> > Michael Kimsal wrote:
> > > Can you point me to the process for submitting these small
> > patches?  I'm
> > > looking at the jira site but don't see much of anything there
> > outlining a
> > > process for submitting patches.  Sorry to be so basic about this, but
> > I'm
> > > trying to follow correct procedures on both sides of the aisle, so to
> > > speak.
> > >
> >
> > Check: http://wiki.apache.org/solr/HowToContribute
> >
> > Essentially you will create a new issue on JIRA, then upload a svn diff
> > to that issue.
> >
> > holler if you have any troubles
> >
> > ryan
> >
> >
>
>
> --
> Michael Kimsal
> http://webdevradio.com
>



-- 
Michael Kimsal
http://webdevradio.com





Re: case sensitivity

2007-04-27 Thread Yonik Seeley

On 4/26/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:

I think we should open up as many of the switches as we can to
QueryParser, allowing users to tinker with them if they want, setting
the defaults to the most common reasonable settings we can agree upon.


I think we should also try and handle what we can automatically too.
Always lowercasing or not isn't elegant, as the right thing to do
depends on the field.

I always had it in my head that the QueryParser should figure it out.
Actually, for good performance, the fieldType should figure it out just once.
The presense of a LowerCaseFilter could be one signal to lowercase
prefix strings,
or one could actually run a test token through analysis and test if it
comes out lowercased.

Numeric fields are a sticking point... prefix queries and wildcard
queries aren't even possible there.  Of course, even stemming is
problematic with wildcard queries.

-Yonik


Re: case sensitivity

2007-04-27 Thread Yonik Seeley

On 4/26/07, Michael Kimsal <[EMAIL PROTECTED]> wrote:

My colleague, after some digging, found in SolrQueryParser

(around line 62)
setLowercaseExpandedTerms(false);

The default for Lucene is true.  Was this intentional?  Or an oversight?


Way back before Solr was opensourced, and Chris was the only
"user", I thought he needed to do prefix queries where case sensitive
wildcard queries (hence I set it to false).  I think I may have been
mistaken about that need, but by that time, I didn't know if anyone
depended on it, so I never changed it back.

A default of false is actually more powerful too.  You can do prefix
queries on fields that have a LowercaseFilter in their analyzer, and
also fields that don't.  If it's set to true, you can't reliably do
prefix queries on fields that don't have a LowercaseFilter.

-Yonik


Re: case sensitivity

2007-04-27 Thread Michael Pelz Sherman
In our experience, setting a LowercaseFilter in the query did not work; we had 
to call setLowercaseExpandedTerms(true) to get wildcard queries to be 
case-insensitive.
   
  Here's our analyzer definition from our solr schema:
   








  
   
  If calling setLowercaseExpandedTerms(true) is *not* in fact necessary for 
case-insensitive wildcard queries, could you please provide an example of a 
solr schema that can achieve this?
   
  Thanks!
  - mps
  
Yonik Seeley <[EMAIL PROTECTED]> wrote:
  On 4/26/07, Michael Kimsal wrote:
> My colleague, after some digging, found in SolrQueryParser
>
> (around line 62)
> setLowercaseExpandedTerms(false);
>
> The default for Lucene is true. Was this intentional? Or an oversight?

Way back before Solr was opensourced, and Chris was the only
"user", I thought he needed to do prefix queries where case sensitive
wildcard queries (hence I set it to false). I think I may have been
mistaken about that need, but by that time, I didn't know if anyone
depended on it, so I never changed it back.

A default of false is actually more powerful too. You can do prefix
queries on fields that have a LowercaseFilter in their analyzer, and
also fields that don't. If it's set to true, you can't reliably do
prefix queries on fields that don't have a LowercaseFilter.

-Yonik



Re: case sensitivity

2007-04-27 Thread Yonik Seeley

On 4/27/07, Michael Pelz Sherman <[EMAIL PROTECTED]> wrote:

In our experience, setting a LowercaseFilter in the query did not work; we had 
to call setLowercaseExpandedTerms(true) to get wildcard queries to be 
case-insensitive.


Correct, because in that case the QueryParser does not invoke analysis
(because it's a partial word, not a whole word).


  If calling setLowercaseExpandedTerms(true) is *not* in fact necessary for 
case-insensitive wildcard queries, could you please provide an example of a 
solr schema that can achieve this?


I didn't say that :-)

I'm saying setLowercaseExpandedTerms(true) is not sufficient for
wildcard queries in general.  If the term is indexed as "Windows95",
then a prefix query of Windows* won't find anything if
setLowercaseExpandedTerms(true)

-Yonik



Yonik Seeley <[EMAIL PROTECTED]> wrote:
  On 4/26/07, Michael Kimsal wrote:
> My colleague, after some digging, found in SolrQueryParser
>
> (around line 62)
> setLowercaseExpandedTerms(false);
>
> The default for Lucene is true. Was this intentional? Or an oversight?

Way back before Solr was opensourced, and Chris was the only
"user", I thought he needed to do prefix queries where case sensitive
wildcard queries (hence I set it to false). I think I may have been
mistaken about that need, but by that time, I didn't know if anyone
depended on it, so I never changed it back.

A default of false is actually more powerful too. You can do prefix
queries on fields that have a LowercaseFilter in their analyzer, and
also fields that don't. If it's set to true, you can't reliably do
prefix queries on fields that don't have a LowercaseFilter.

-Yonik




TextField case sensitivity

2007-06-07 Thread Xuesong Luo
I run a problem when searching on a TextField. When I pass q=William or
q=WILLiam, solr is able to find records whose default search field value
is William, however if I pass q=WilliAm, solr did not return any thing.
I searched on the archive, Yonik mentioned the lowercasefilterfactory
doesn't work for wildcard because the QueryParser does not invoke
analysis for partial word, that makes sense. But in my case, it's a
whole word. Anyone knows why it's not working? Below is my schema info.

Thanks
Xuesong


  


  
  


  




Re: TextField case sensitivity

2007-06-07 Thread Yonik Seeley

On 6/7/07, Xuesong Luo <[EMAIL PROTECTED]> wrote:

I run a problem when searching on a TextField. When I pass q=William or
q=WILLiam, solr is able to find records whose default search field value
is William, however if I pass q=WilliAm, solr did not return any thing.


Sounds like WordDelimiterFilter is still being used for your fieldType.
After you changed the fieldType for "text", did you restart Solr and
re-index your collection?

-Yonik



I searched on the archive, Yonik mentioned the lowercasefilterfactory
doesn't work for wildcard because the QueryParser does not invoke
analysis for partial word, that makes sense. But in my case, it's a
whole word. Anyone knows why it's not working? Below is my schema info.

Thanks
Xuesong


  


  
  


  



Re: TextField case sensitivity

2007-06-07 Thread Ryan McKinley

have you taken a look the output from the admin/analysis?
http://localhost:8983/solr/admin/analysis.jsp?highlight=on

This lets you see what tokens are generated for index/query.  From your 
description, I'm suspicious that the generated tokens are actually:

 willi am

Also, if you want the same analyzer for indexing and query, just define one:



 




Xuesong Luo wrote:

I run a problem when searching on a TextField. When I pass q=William or
q=WILLiam, solr is able to find records whose default search field value
is William, however if I pass q=WilliAm, solr did not return any thing.
I searched on the archive, Yonik mentioned the lowercasefilterfactory
doesn't work for wildcard because the QueryParser does not invoke
analysis for partial word, that makes sense. But in my case, it's a
whole word. Anyone knows why it's not working? Below is my schema info.

Thanks
Xuesong


  



  
  


  







RE: TextField case sensitivity

2007-06-07 Thread Xuesong Luo
I have WordDelimiterFilter defined in the schema, I didn't include it in
my original email because I thought it doesn't matter. It seems it
matters. Looks like WilliAm is treated as two words. That's why it
didn't find a match.

Thanks
Xuesong

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Thursday, June 07, 2007 11:25 AM
To: solr-user@lucene.apache.org
Subject: Re: TextField case sensitivity

On 6/7/07, Xuesong Luo <[EMAIL PROTECTED]> wrote:
> I run a problem when searching on a TextField. When I pass q=William
or
> q=WILLiam, solr is able to find records whose default search field
value
> is William, however if I pass q=WilliAm, solr did not return any
thing.

Sounds like WordDelimiterFilter is still being used for your fieldType.
After you changed the fieldType for "text", did you restart Solr and
re-index your collection?

-Yonik


> I searched on the archive, Yonik mentioned the lowercasefilterfactory
> doesn't work for wildcard because the QueryParser does not invoke
> analysis for partial word, that makes sense. But in my case, it's a
> whole word. Anyone knows why it's not working? Below is my schema
info.
>
> Thanks
> Xuesong
>
>  positionIncrementGap="100">
>   
> 
> 
>   
>   
> 
> 
>   
> 




RE: TextField case sensitivity

2007-06-07 Thread Xuesong Luo
Ryan, you are right, that's the problem. WilliAM is treated as two words
by the WordDelimiterFilterFactory.

Thanks
Xuesong

-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 07, 2007 11:30 AM
To: solr-user@lucene.apache.org
Subject: Re: TextField case sensitivity

have you taken a look the output from the admin/analysis?
http://localhost:8983/solr/admin/analysis.jsp?highlight=on

This lets you see what tokens are generated for index/query.  From your 
description, I'm suspicious that the generated tokens are actually:
  willi am

Also, if you want the same analyzer for indexing and query, just define
one:


 
  




Xuesong Luo wrote:
> I run a problem when searching on a TextField. When I pass q=William
or
> q=WILLiam, solr is able to find records whose default search field
value
> is William, however if I pass q=WilliAm, solr did not return any
thing.
> I searched on the archive, Yonik mentioned the lowercasefilterfactory
> doesn't work for wildcard because the QueryParser does not invoke
> analysis for partial word, that makes sense. But in my case, it's a
> whole word. Anyone knows why it's not working? Below is my schema
info.
> 
> Thanks
> Xuesong
> 
>  positionIncrementGap="100">
>   
> 
> 
>   
>   
> 
> 
>   
> 
> 
> 




Re: TextField case sensitivity

2007-06-07 Thread Mike Klaas


On 7-Jun-07, at 1:04 PM, Xuesong Luo wrote:

Ryan, you are right, that's the problem. WilliAM is treated as two  
words

by the WordDelimiterFilterFactory.


I have found this behaviour a little too aggresive for my needs, so i  
added an option to disable it.  Patch is here:

http://issues.apache.org/jira/browse/SOLR-257

I'll probably commit it in a day or so, at which point it will be  
part of the Solr nightly build.


-Mike


Text field case sensitivity problem

2011-06-14 Thread Jamie Johnson
I am using the following for my text field:


  








  
  







  


I have a field defined as
   

when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?


Re: Text field case sensitivity problem

2011-06-14 Thread Jamie Johnson
Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson  wrote:

> I am using the following for my text field:
>
>  positionIncrementGap="100" autoGeneratePhraseQueries="true">
>   
> 
> 
> 
>  ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
>   
> 
>  ignoreCase="true" expand="true"/>
>  ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
> 
>
> I have a field defined as
>
>
> when I execute a go to the following url I get results
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
> but if I do
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
> I get nothing.  I thought the LowerCaseFilterFactory would have handled
> lowercasing both the query and what is being indexed, am I missing
> something?
>


Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov
Wildcard queries aren't analyzed, I think?  I'm not completely sure what 
the best workaround is here: perhaps simply lowercasing the query terms 
yourself in the application.  Also - I hope someone more knowledgeable 
will say that the new HighlightQuery in trunk doesn't have this 
restriction, but I'm not sure about that.


-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson  wrote:

   

I am using the following for my text field:

 
   
 
 
 
 
 
 
 
 
   
   
 
 
 
 
 
 
 
   
 

I have a field defined as


when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?

 
   


RE: Text field case sensitivity problem

2011-06-14 Thread Bob Sandiford
Unfortunately, wild card search terms don't get processed by the analyzers.

One suggestion that's fairly common is to make sure you lower case your wild 
card search terms yourself before issuing the query.

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com

> -Original Message-
> From: Jamie Johnson [mailto:jej2...@gmail.com]
> Sent: Tuesday, June 14, 2011 5:13 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Text field case sensitivity problem
> 
> Also of interest to me is this returns results
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine
> 
> 
> On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson 
> wrote:
> 
> > I am using the following for my text field:
> >
> >  > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >   
> > 
> > 
> > 
> >  > ignoreCase="true"
> > words="stopwords.txt"
> > enablePositionIncrements="true"
> > />
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >  > protected="protwords.txt"/>
> > 
> >   
> >   
> > 
> >  synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >  > ignoreCase="true"
> > words="stopwords.txt"
> > enablePositionIncrements="true"
> > />
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >  > protected="protwords.txt"/>
> > 
> >   
> > 
> >
> > I have a field defined as
> > />
> >
> > when I execute a go to the following url I get results
> > http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
> > but if I do
> > http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
> > I get nothing.  I thought the LowerCaseFilterFactory would have
> handled
> > lowercasing both the query and what is being indexed, am I missing
> > something?
> >



Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov

opps, please s/Highlight/Wildcard/

On 06/14/2011 05:31 PM, Mike Sokolov wrote:
Wildcard queries aren't analyzed, I think?  I'm not completely sure 
what the best workaround is here: perhaps simply lowercasing the query 
terms yourself in the application.  Also - I hope someone more 
knowledgeable will say that the new HighlightQuery in trunk doesn't 
have this restriction, but I'm not sure about that.


-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson  
wrote:



I am using the following for my text field:























I have a field defined as


when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?



Re: Text field case sensitivity problem

2011-06-15 Thread Jamie Johnson
So simply lower casing the works but can get complex.  The query that I'm
executing may have things like ranges which require some words to be upper
case (i.e. TO).  I think this would be much better solved on Solrs end, is
there a JIRA about this?

On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov  wrote:

> opps, please s/Highlight/Wildcard/
>
>
> On 06/14/2011 05:31 PM, Mike Sokolov wrote:
>
>> Wildcard queries aren't analyzed, I think?  I'm not completely sure what
>> the best workaround is here: perhaps simply lowercasing the query terms
>> yourself in the application.  Also - I hope someone more knowledgeable will
>> say that the new HighlightQuery in trunk doesn't have this restriction, but
>> I'm not sure about that.
>>
>> -Mike
>>
>> On 06/14/2011 05:13 PM, Jamie Johnson wrote:
>>
>>> Also of interest to me is this returns results
>>> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine
>>>
>>>
>>> On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson
>>>  wrote:
>>>
>>>  I am using the following for my text field:

 >>> positionIncrementGap="100" autoGeneratePhraseQueries="true">
 
 
 
 
 >>> ignoreCase="true"
 words="stopwords.txt"
 enablePositionIncrements="true"
 />
 >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
 catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
 
 >>> protected="protwords.txt"/>
 
 
 
 
 >>> ignoreCase="true" expand="true"/>
 >>> ignoreCase="true"
 words="stopwords.txt"
 enablePositionIncrements="true"
 />
 >>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
 catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
 
 >>> protected="protwords.txt"/>
 
 
 

 I have a field defined as
 

 when I execute a go to the following url I get results
 http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
 but if I do
 http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
 I get nothing.  I thought the LowerCaseFilterFactory would have handled
 lowercasing both the query and what is being indexed, am I missing
 something?




Re: Text field case sensitivity problem

2011-06-15 Thread Mike Sokolov
I wonder whether CharFilters are applied to wildcard terms?  I suspect 
they might be.  If that's the case, you could use the MappingCharFilter 
to perform lowercasing (and strip diacritics too if you want that)


-Mike

On 06/15/2011 10:12 AM, Jamie Johnson wrote:
So simply lower casing the works but can get complex.  The query that 
I'm executing may have things like ranges which require some words to 
be upper case (i.e. TO).  I think this would be much better solved on 
Solrs end, is there a JIRA about this?


On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov > wrote:


opps, please s/Highlight/Wildcard/


On 06/14/2011 05:31 PM, Mike Sokolov wrote:

Wildcard queries aren't analyzed, I think?  I'm not completely
sure what the best workaround is here: perhaps simply
lowercasing the query terms yourself in the application.  Also
- I hope someone more knowledgeable will say that the new
HighlightQuery in trunk doesn't have this restriction, but I'm
not sure about that.

-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

Also of interest to me is this returns results

http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine




On Tue, Jun 14, 2011 at 5:08 PM, Jamie
Johnsonmailto:jej2...@gmail.com>>  wrote:

I am using the following for my text field:























I have a field defined as


when I execute a go to the following url I get results

http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*


but if I do

http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*


I get nothing.  I thought the LowerCaseFilterFactory
would have handled
lowercasing both the query and what is being indexed,
am I missing
something?




Re: Text field case sensitivity problem

2011-06-30 Thread Jamie Johnson
I'm not familiar with the CharFilters, I'll look into those now.

Is the solr.LowerCaseFilterFactory not handling wildcards the expected
result or is this a bug?

On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolov  wrote:
> I wonder whether CharFilters are applied to wildcard terms?  I suspect they
> might be.  If that's the case, you could use the MappingCharFilter to
> perform lowercasing (and strip diacritics too if you want that)
>
> -Mike
>
> On 06/15/2011 10:12 AM, Jamie Johnson wrote:
>
> So simply lower casing the works but can get complex.  The query that I'm
> executing may have things like ranges which require some words to be upper
> case (i.e. TO).  I think this would be much better solved on Solrs end, is
> there a JIRA about this?
>
> On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov  wrote:
>>
>> opps, please s/Highlight/Wildcard/
>>
>> On 06/14/2011 05:31 PM, Mike Sokolov wrote:
>>>
>>> Wildcard queries aren't analyzed, I think?  I'm not completely sure what
>>> the best workaround is here: perhaps simply lowercasing the query terms
>>> yourself in the application.  Also - I hope someone more knowledgeable will
>>> say that the new HighlightQuery in trunk doesn't have this restriction, but
>>> I'm not sure about that.
>>>
>>> -Mike
>>>
>>> On 06/14/2011 05:13 PM, Jamie Johnson wrote:

 Also of interest to me is this returns results
 http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine


 On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson
  wrote:

> I am using the following for my text field:
>
>  positionIncrementGap="100" autoGeneratePhraseQueries="true">
> 
> 
> 
> 
>                  ignoreCase="true"
>                 words="stopwords.txt"
>                 enablePositionIncrements="true"
>                 />
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
>  ignoreCase="true" expand="true"/>
>                  ignoreCase="true"
>                 words="stopwords.txt"
>                 enablePositionIncrements="true"
>                 />
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
>
> I have a field defined as
> 
>
> when I execute a go to the following url I get results
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
> but if I do
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
> I get nothing.  I thought the LowerCaseFilterFactory would have handled
> lowercasing both the query and what is being indexed, am I missing
> something?
>
>
>


Re: Text field case sensitivity problem

2011-06-30 Thread Jamie Johnson
I think my answer is here...

"On wildcard and fuzzy searches, no text analysis is performed on the
search word. "

taken from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers


On Thu, Jun 30, 2011 at 10:23 AM, Jamie Johnson  wrote:
> I'm not familiar with the CharFilters, I'll look into those now.
>
> Is the solr.LowerCaseFilterFactory not handling wildcards the expected
> result or is this a bug?
>
> On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolov  wrote:
>> I wonder whether CharFilters are applied to wildcard terms?  I suspect they
>> might be.  If that's the case, you could use the MappingCharFilter to
>> perform lowercasing (and strip diacritics too if you want that)
>>
>> -Mike
>>
>> On 06/15/2011 10:12 AM, Jamie Johnson wrote:
>>
>> So simply lower casing the works but can get complex.  The query that I'm
>> executing may have things like ranges which require some words to be upper
>> case (i.e. TO).  I think this would be much better solved on Solrs end, is
>> there a JIRA about this?
>>
>> On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov  wrote:
>>>
>>> opps, please s/Highlight/Wildcard/
>>>
>>> On 06/14/2011 05:31 PM, Mike Sokolov wrote:

 Wildcard queries aren't analyzed, I think?  I'm not completely sure what
 the best workaround is here: perhaps simply lowercasing the query terms
 yourself in the application.  Also - I hope someone more knowledgeable will
 say that the new HighlightQuery in trunk doesn't have this restriction, but
 I'm not sure about that.

 -Mike

 On 06/14/2011 05:13 PM, Jamie Johnson wrote:
>
> Also of interest to me is this returns results
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine
>
>
> On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson
>  wrote:
>
>> I am using the following for my text field:
>>
>> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
>> 
>> 
>> 
>> 
>> >                 ignoreCase="true"
>>                 words="stopwords.txt"
>>                 enablePositionIncrements="true"
>>                 />
>> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> 
>> > protected="protwords.txt"/>
>> 
>> 
>> 
>> 
>> > ignoreCase="true" expand="true"/>
>> >                 ignoreCase="true"
>>                 words="stopwords.txt"
>>                 enablePositionIncrements="true"
>>                 />
>> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> 
>> > protected="protwords.txt"/>
>> 
>> 
>> 
>>
>> I have a field defined as
>> 
>>
>> when I execute a go to the following url I get results
>> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
>> but if I do
>> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
>> I get nothing.  I thought the LowerCaseFilterFactory would have handled
>> lowercasing both the query and what is being indexed, am I missing
>> something?
>>
>>
>>
>


Re: Text field case sensitivity problem

2011-06-30 Thread Mike Sokolov
Yes, after posting that response, I read some more and came to the same 
conclusion... there seems to be some interest on the dev list in 
building a capability to specify an analysis chain for use with wildcard 
and related queries, but it doesn't exist now.


-Mike

On 06/30/2011 10:34 AM, Jamie Johnson wrote:

I think my answer is here...

"On wildcard and fuzzy searches, no text analysis is performed on the
search word. "

taken from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers


On Thu, Jun 30, 2011 at 10:23 AM, Jamie Johnson  wrote:
   

I'm not familiar with the CharFilters, I'll look into those now.

Is the solr.LowerCaseFilterFactory not handling wildcards the expected
result or is this a bug?

On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolov  wrote:
 

I wonder whether CharFilters are applied to wildcard terms?  I suspect they
might be.  If that's the case, you could use the MappingCharFilter to
perform lowercasing (and strip diacritics too if you want that)

-Mike

On 06/15/2011 10:12 AM, Jamie Johnson wrote:

So simply lower casing the works but can get complex.  The query that I'm
executing may have things like ranges which require some words to be upper
case (i.e. TO).  I think this would be much better solved on Solrs end, is
there a JIRA about this?

On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov  wrote:
   

opps, please s/Highlight/Wildcard/

On 06/14/2011 05:31 PM, Mike Sokolov wrote:
 

Wildcard queries aren't analyzed, I think?  I'm not completely sure what
the best workaround is here: perhaps simply lowercasing the query terms
yourself in the application.  Also - I hope someone more knowledgeable will
say that the new HighlightQuery in trunk doesn't have this restriction, but
I'm not sure about that.

-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:
   

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson
  wrote:

 

I am using the following for my text field:























I have a field defined as


when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?

   


   
 


Re: Text field case sensitivity problem

2011-06-30 Thread Erik Hatcher
Jamie - there is a JIRA about this, at least one: 


Erik
 
On Jun 15, 2011, at 10:12 , Jamie Johnson wrote:

> So simply lower casing the works but can get complex.  The query that I'm
> executing may have things like ranges which require some words to be upper
> case (i.e. TO).  I think this would be much better solved on Solrs end, is
> there a JIRA about this?
> 
> On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov  wrote:
> 
>> opps, please s/Highlight/Wildcard/
>> 
>> 
>> On 06/14/2011 05:31 PM, Mike Sokolov wrote:
>> 
>>> Wildcard queries aren't analyzed, I think?  I'm not completely sure what
>>> the best workaround is here: perhaps simply lowercasing the query terms
>>> yourself in the application.  Also - I hope someone more knowledgeable will
>>> say that the new HighlightQuery in trunk doesn't have this restriction, but
>>> I'm not sure about that.
>>> 
>>> -Mike
>>> 
>>> On 06/14/2011 05:13 PM, Jamie Johnson wrote:
>>> 
 Also of interest to me is this returns results
 http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine
 
 
 On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson
 wrote:
 
 I am using the following for my text field:
> 
>  positionIncrementGap="100" autoGeneratePhraseQueries="true">
> 
> 
> 
> 
> ignoreCase="true"
>words="stopwords.txt"
>enablePositionIncrements="true"
>/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
>  ignoreCase="true" expand="true"/>
> ignoreCase="true"
>words="stopwords.txt"
>enablePositionIncrements="true"
>/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
> 
> 
> 
> I have a field defined as
> 
> 
> when I execute a go to the following url I get results
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
> but if I do
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
> I get nothing.  I thought the LowerCaseFilterFactory would have handled
> lowercasing both the query and what is being indexed, am I missing
> something?
> 
> 



Re: Text field case sensitivity problem

2011-06-30 Thread Mike Sokolov

Yes, and this too: https://issues.apache.org/jira/browse/SOLR-219

On 06/30/2011 12:46 PM, Erik Hatcher wrote:

Jamie - there is a JIRA about this, at least 
one:

Erik

On Jun 15, 2011, at 10:12 , Jamie Johnson wrote:

   

So simply lower casing the works but can get complex.  The query that I'm
executing may have things like ranges which require some words to be upper
case (i.e. TO).  I think this would be much better solved on Solrs end, is
there a JIRA about this?

On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov  wrote:

 

opps, please s/Highlight/Wildcard/


On 06/14/2011 05:31 PM, Mike Sokolov wrote:

   

Wildcard queries aren't analyzed, I think?  I'm not completely sure what
the best workaround is here: perhaps simply lowercasing the query terms
yourself in the application.  Also - I hope someone more knowledgeable will
say that the new HighlightQuery in trunk doesn't have this restriction, but
I'm not sure about that.

-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

 

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson
wrote:

I am using the following for my text field:
   























I have a field defined as


when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?


 
   


eDisMax and Boolean operator case-sensitivity

2013-11-06 Thread Tom Mortimer
Hi,

I'm using eDisMax query parser, and need to support Boolean operators AND
and OR. It seems from testing that these are *not* case sensitive, e.g.
setting mm to 0, "oscar AND wilde" returns the same results as "oscar and
wilde" (15 hits) while "oscar foo wilde" returns the same results as "oscar
wilde" (2000 hits).

Is it possible to configure eDisMax to do case-sensitive parsing, so that
"AND" is an operator but "and" is just another term?

thanks,
Tom


DataImportHandler - case sensitivity of column names

2010-02-08 Thread Alexey Serba
I encountered the problem with Oracle converting column names to upper
case. As a result SolrInputDocument is created with field names in
upper case and "Document [null] missing required field: id" exception
is thrown ( although ID field is defined ).

I do not specify "field" elements explicitly.

I know that I can rewrite all my queries to "select id as "id", body
as "body" from document" format, but is there any other workaround for
this? case insensitive option or something?

Here's my data-config:

  
  

  
  
  

  


Alexey


solr field types and case sensitivity

2007-12-18 Thread Dryganets Sergey

can I change query analyzer for concrete request to solr?
ie: I want add option on my site use case-sensitive search or not for this
search request, but can't find any good solution ...

I think that create duplicates (index only fields with different analyzers
configuration) for each field it's bad idea ...

May be any one know good solution for this problem?

-- 
View this message in context: 
http://www.nabble.com/solr-field-types-and-case-sensitivity-tp14395912p14395912.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: eDisMax and Boolean operator case-sensitivity

2013-11-06 Thread Shawn Heisey

On 11/6/2013 11:46 AM, Tom Mortimer wrote:

I'm using eDisMax query parser, and need to support Boolean operators AND
and OR. It seems from testing that these are *not* case sensitive, e.g.
setting mm to 0, "oscar AND wilde" returns the same results as "oscar and
wilde" (15 hits) while "oscar foo wilde" returns the same results as "oscar
wilde" (2000 hits).

Is it possible to configure eDisMax to do case-sensitive parsing, so that
"AND" is an operator but "and" is just another term?


Include another query parameter: lowercaseOperators=false

http://wiki.apache.org/solr/ExtendedDisMax#lowercaseOperators

Thanks,
Shawn



Re: eDisMax and Boolean operator case-sensitivity

2013-11-06 Thread Tom Mortimer
Oh, good grief - I was just reading that page, how did I miss that? *derp*

Thanks Shawn!!!

Tom


On 6 November 2013 18:59, Shawn Heisey  wrote:

> On 11/6/2013 11:46 AM, Tom Mortimer wrote:
>
>> I'm using eDisMax query parser, and need to support Boolean operators AND
>> and OR. It seems from testing that these are *not* case sensitive, e.g.
>>
>> setting mm to 0, "oscar AND wilde" returns the same results as "oscar and
>> wilde" (15 hits) while "oscar foo wilde" returns the same results as
>> "oscar
>> wilde" (2000 hits).
>>
>> Is it possible to configure eDisMax to do case-sensitive parsing, so that
>> "AND" is an operator but "and" is just another term?
>>
>
> Include another query parameter: lowercaseOperators=false
>
> http://wiki.apache.org/solr/ExtendedDisMax#lowercaseOperators
>
> Thanks,
> Shawn
>
>


documentation deficiency : case sensitivity of boolean operators

2009-09-15 Thread Jonathan Vanasco

I couldn't find this anywhere on solr's docs / faq

i finally found a reference on lucene
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

this should really be added somewhere.  i'm not sure where, but I  
thought this was worth bringing up to the list -- as it really  
confused the hell out of me :)


Re: DataImportHandler - case sensitivity of column names

2010-02-08 Thread Shalin Shekhar Mangar
On Mon, Feb 8, 2010 at 3:59 PM, Alexey Serba  wrote:

> I encountered the problem with Oracle converting column names to upper
> case. As a result SolrInputDocument is created with field names in
> upper case and "Document [null] missing required field: id" exception
> is thrown ( although ID field is defined ).
>
> I do not specify "field" elements explicitly.
>
> I know that I can rewrite all my queries to "select id as "id", body
> as "body" from document" format, but is there any other workaround for
> this? case insensitive option or something?
>
> Here's my data-config:
> 
>   driver="oracle.jdbc.driver.OracleDriver" password="oracle"
> url="jdbc:oracle:thin:@localhost:1521:xe" user="SYSTEM"/>
>  
> query="select id, body from document"
> transformer="TemplateTransformer">
>  
>  
>  
>
>  
> 
>
>
Fields are imported in a case-insensitive manner as long as they are not
specified explicitly. In this case, however, the problem is that the ${
root.id} is case sensitive. There is no way right now to resolve variables
in a case-insensitive manner.

-- 
Regards,
Shalin Shekhar Mangar.


Re: solr field types and case sensitivity

2007-12-18 Thread Ryan McKinley

Dryganets Sergey wrote:

can I change query analyzer for concrete request to solr?
ie: I want add option on my site use case-sensitive search or not for this
search request, but can't find any good solution ...

I think that create duplicates (index only fields with different analyzers
configuration) for each field it's bad idea ...



yes, you would index a field twice - once with a LowerCaseFilter and 
once without.  That is a good solution.


ryan


Re: solr field types and case sensitivity

2007-12-18 Thread Dryganets Sergey



ryantxu wrote:
> 
> yes, you would index a field twice - once with a LowerCaseFilter and 
> once without.  That is a good solution.
> 

Hm... 
So I'm should create n*n indexes where n is search options count ...

Can I copy fields automatically?  

For example I have a field with name  and subset of fields with
prefixes or suffixes, so
can I use regexp to copy field.

Or may be I can describe "copy field policy" for a fieldType (as for me this
solution will be better - there are less efforts to add new search option)

-- 
View this message in context: 
http://www.nabble.com/solr-field-types-and-case-sensitivity-tp14395912p14411420.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr field types and case sensitivity

2008-01-11 Thread Dryganets Sergey



ryantxu wrote:
> 
> yes, you would index a field twice - once with a LowerCaseFilter and 
> once without.  That is a good solution.
> 

I implement this and now there are some strange problems:

1. highlighting not always properly work for not case sensitive search






   
 

you can find full schema on:
https://issues.apache.org/jira/browse/SOLR-444

I know a bad solution for this problem - store data for the both cs and ncs
fields
(but I don't want store also 200mb for each search option)

As for me in solr now too much problems with highlighting :(

As I understand to highlight properly lucene-highlighter needs Analyzer, but
Analyzer linked to field type.
So I think it's root of this problem ...

maybe anyone know a good solution? 
ie solve this problem with just configuration changes

2.
add following document:


 3
 Apples





fl=*,score&q=ncs_post_text:apple&start=0&rows=10&hl=true&hl.fl=post_text
return document with id=3, but not return any highlighting info
(appl, apple and apples return same result)

fl=*,score&q=cs_post_text:Apple&start=0&rows=10&hl=true&hl.fl=post_text
return nothing, request for Appl return nothing too

and just Apples word return result

I don't understand it ... I just remove lower case factory









-- 
View this message in context: 
http://www.nabble.com/solr-field-types-and-case-sensitivity-tp14395912p14751690.html
Sent from the Solr - User mailing list archive at Nabble.com.



Case sensitivity on hostnames and email addresses

2006-12-13 Thread Wade Leftwich
I've run into some unexpected case sensitivity on searches, at least
unexpected by me.

If you index a text field containing this sentence:

A sentence containing CamelCase words by [EMAIL PROTECTED] is found
at StudlyCaps.org

The document will be found by searching for "camelcase" but not for
"[EMAIL PROTECTED]" or "studlycaps.org".

This happens with the Standard or the DisMax query handler.

A bit of a problem for me, because I'm indexing a bunch of business
magazines, and domain names are frequently capitalized, often in CamelCase.

Is this maybe a bug? Or a WAD?

-- Wade Leftwich
Ithaca, NY



why does "*" affect case sensitivity of query results

2013-04-29 Thread geeky2
hello,

environment: solr 3.5


problem statement: when query has "*" appended, it turns case sensitive.

assumption: query should NOT be case sensitive

actual value in database at time of index: 4387828BULK

here is a snapshot of what works and does not work.

what works:

  itemModelNoExactMatchStr:4387828bULk (and any variation of upper and lower
case letters for *bulk*)

  itemModelNoExactMatchStr:4387828bu*
  itemModelNoExactMatchStr:4387828bul*
  itemModelNoExactMatchStr:4387828bulk*


what does NOT work:

 itemModelNoExactMatchStr:4387828BU*
 itemModelNoExactMatchStr:4387828BUL*
 itemModelNoExactMatchStr:4387828BULK*


below are the specifics of my field and fieldType

  



  



  
  


  


thx
mark





--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Case-sensitivity issue with search field name

2013-02-28 Thread hyrax
Hi guys,

I'm using Solr 4.0 and I recently notice an issue that bothers me a lot
which is that if you define a field in your schema named 'HOST' then in the
query you have to specify this field by 'HOST' while if you used 'host' it
would throw an 'undefined field' error.

I have done some googling while I only found a jira ticket which says this
issue had been fixed:  https://issues.apache.org/jira/browse/SOLR-873
<https://issues.apache.org/jira/browse/SOLR-873>  

I know I can use  to accomplish this but I'm wonder if there a
way to apply this change all the field on the fly not one by one ...

Many many thanks in advance!
Thanks,
Hyrax



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Case-sensitivity-issue-with-search-field-name-tp4043800.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to ignore whitespace/ case sensitivity with dedupe

2011-05-28 Thread tinman
Hi all,

I've followed the instructions at this link
http://wiki.apache.org/solr/Deduplication and got the basic dedupe field
working. However, it doesn't seem to recognize case differences or white
space differences even thought I've defined the type of the fields to be
used for dedupe as well as the signature field as followings in schema.xml


  


  
  


  




and in the solrconfig.xml 

  true
  false
  signatureField
  name
  org.apache.solr.update.processor.Lookup3Signature



  

I know a possible solution is to lowercase and remove white spaces for the
field "name" before submiting documents to solr, but is there any other
alternatives so that when the following data is given
Name: JOHN SMITH and jOhn  SMITh the documents have the same outcome in
signatureField?

Thanks heaps
Cheers
tinman







--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997624.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: documentation deficiency : case sensitivity of boolean operators

2009-09-15 Thread Chris Hostetter

: Subject: documentation deficiency : case sensitivity of boolean operators
: 
: I couldn't find this anywhere on solr's docs / faq

if you have suggestions on places to add it, feel free to update the wiki.

(most of the documentation is deliberatly agnostic to the specifics of the 
query parser syntax, instead relying on links to point you to the same 
refrence URL you found ... so i can't actually think of anywhere in the 
Solr docs that mentions the AND/OR/NOT syntax that it would make sense to 
clarify this)

-Hoss



Re: documentation deficiency : case sensitivity of boolean operators

2009-09-15 Thread Yonik Seeley
That's already linked from
http://wiki.apache.org/solr/SolrQuerySyntax

-Yonik
http://www.lucidimagination.com


On Tue, Sep 15, 2009 at 5:38 PM, Jonathan Vanasco  wrote:
> I couldn't find this anywhere on solr's docs / faq
>
> i finally found a reference on lucene
>        http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
>
> this should really be added somewhere.  i'm not sure where, but I thought
> this was worth bringing up to the list -- as it really confused the hell out
> of me :)
>


Re: Case sensitivity on hostnames and email addresses

2006-12-13 Thread Otis Gospodnetic
When indexing (and searching), make sure you are using an Analyzer that 
lower-cases (or upper-cases) tokens.
These are from Lucene, so Solr has them, too:
  ./src/java/org/apache/lucene/analysis/LowerCaseTokenizer.java
  ./src/java/org/apache/lucene/analysis/LowerCaseFilter.java

Otis

- Original Message 
From: Wade Leftwich <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, December 13, 2006 11:32:11 PM
Subject: Case sensitivity on hostnames and email addresses

I've run into some unexpected case sensitivity on searches, at least
unexpected by me.

If you index a text field containing this sentence:

A sentence containing CamelCase words by [EMAIL PROTECTED] is found
at StudlyCaps.org

The document will be found by searching for "camelcase" but not for
"[EMAIL PROTECTED]" or "studlycaps.org".

This happens with the Standard or the DisMax query handler.

A bit of a problem for me, because I'm indexing a bunch of business
magazines, and domain names are frequently capitalized, often in CamelCase.

Is this maybe a bug? Or a WAD?

-- Wade Leftwich
Ithaca, NY






Re: Case sensitivity on hostnames and email addresses

2006-12-13 Thread Walter Underwood
Also, avoid stemming URLs. I used a stemmer that turned my
"best.com" URL into "good.com". The Lucene StandardAnalyzer
works pretty hard to avoid that. --wunder

On 12/13/06 9:33 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> When indexing (and searching), make sure you are using an Analyzer that
> lower-cases (or upper-cases) tokens.
> These are from Lucene, so Solr has them, too:
>   ./src/java/org/apache/lucene/analysis/LowerCaseTokenizer.java
>   ./src/java/org/apache/lucene/analysis/LowerCaseFilter.java
> 
> Otis
> 
> - Original Message 
> From: Wade Leftwich <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, December 13, 2006 11:32:11 PM
> Subject: Case sensitivity on hostnames and email addresses
> 
> I've run into some unexpected case sensitivity on searches, at least
> unexpected by me.
> 
> If you index a text field containing this sentence:
> 
> A sentence containing CamelCase words by [EMAIL PROTECTED] is found
> at StudlyCaps.org
> 
> The document will be found by searching for "camelcase" but not for
> "[EMAIL PROTECTED]" or "studlycaps.org".
> 
> This happens with the Standard or the DisMax query handler.
> 
> A bit of a problem for me, because I'm indexing a bunch of business
> magazines, and domain names are frequently capitalized, often in CamelCase.
> 
> Is this maybe a bug? Or a WAD?
> 
> -- Wade Leftwich
> Ithaca, NY
> 
> 
> 
> 



Re: Case sensitivity on hostnames and email addresses

2006-12-13 Thread Yonik Seeley

On 12/13/06, Wade Leftwich <[EMAIL PROTECTED]> wrote:

I've run into some unexpected case sensitivity on searches, at least
unexpected by me.

If you index a text field containing this sentence:

A sentence containing CamelCase words by [EMAIL PROTECTED] is found
at StudlyCaps.org

The document will be found by searching for "camelcase" but not for
"[EMAIL PROTECTED]" or "studlycaps.org".

This happens with the Standard or the DisMax query handler.

A bit of a problem for me, because I'm indexing a bunch of business
magazines, and domain names are frequently capitalized, often in CamelCase.


It's your text analysis configuration.
The WordDelimiterFilter is doing this... it's so "CamelCase" can be
found searching for "camelcase", "camel-case" or "camel case".
It does this by detecting all the word parts and then indexing them
separately as well as all catenated.  So "CamelCase" is indexed as
both both "camelcase" and "camel case".
When searching, the WordDelimiterFilter is configured to split only,
so "camelcase", "camel-case", and "camel case" will all match.

When it hits something like [EMAIL PROTECTED], it would index it as
"upanddownmysitecom" and "up and down mysite com"
On the search side, a search of "[EMAIL PROTECTED]" is broken into
"upanddown mysite com" which doesn't match anything indexed.

There are a number of options, not limited to
- create a new fieldtype and throw out the WordDelimiterFilter... the
current "text"
  field type is for demonstration purposes only anyway.  Solr, like
Lucene, is meant
  to be customized.
- If you want to keep the camel-case flexibility, but not across "."
and "-", then
  try using a letter tokenizer to throw away the non-letter tokenizers first.
- create a specific filter for email or website addresses if no combination of
  existing filters do what you want.

Play around with the analysis tool on the admin page, it will help you
understand what's going on.

-Yonik


Re: Case sensitivity on hostnames and email addresses

2006-12-13 Thread Yonik Seeley

Oh, and yet another way to get around it (with it's own trade offs) is
to use something like fieldtype textTight in the example schema.xml,
which catenates all word parts in both the index analyzer and query
analyzer.

This would index as "upanddownmysitecom" and allow the following
queries to match:
"[EMAIL PROTECTED]", "[EMAIL PROTECTED]/com", "[EMAIL PROTECTED]"

The downside is that it would *not* allow "upanddown" or "UpAndDown" to match.

-Yonik

On 12/14/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 12/13/06, Wade Leftwich <[EMAIL PROTECTED]> wrote:
> I've run into some unexpected case sensitivity on searches, at least
> unexpected by me.
>
> If you index a text field containing this sentence:
>
> A sentence containing CamelCase words by [EMAIL PROTECTED] is found
> at StudlyCaps.org
>
> The document will be found by searching for "camelcase" but not for
> "[EMAIL PROTECTED]" or "studlycaps.org".
>
> This happens with the Standard or the DisMax query handler.
>
> A bit of a problem for me, because I'm indexing a bunch of business
> magazines, and domain names are frequently capitalized, often in CamelCase.

It's your text analysis configuration.
The WordDelimiterFilter is doing this... it's so "CamelCase" can be
found searching for "camelcase", "camel-case" or "camel case".
It does this by detecting all the word parts and then indexing them
separately as well as all catenated.  So "CamelCase" is indexed as
both both "camelcase" and "camel case".
When searching, the WordDelimiterFilter is configured to split only,
so "camelcase", "camel-case", and "camel case" will all match.

When it hits something like [EMAIL PROTECTED], it would index it as
"upanddownmysitecom" and "up and down mysite com"
On the search side, a search of "[EMAIL PROTECTED]" is broken into
"upanddown mysite com" which doesn't match anything indexed.

There are a number of options, not limited to
 - create a new fieldtype and throw out the WordDelimiterFilter... the
current "text"
   field type is for demonstration purposes only anyway.  Solr, like
Lucene, is meant
   to be customized.
 - If you want to keep the camel-case flexibility, but not across "."
and "-", then
   try using a letter tokenizer to throw away the non-letter tokenizers first.
 - create a specific filter for email or website addresses if no combination of
   existing filters do what you want.

Play around with the analysis tool on the admin page, it will help you
understand what's going on.

-Yonik



Re: why does "*" affect case sensitivity of query results

2013-04-29 Thread Alexandre Rafalovitch
http://wiki.apache.org/solr/MultitermQueryAnalysis

Sorry, not for your version of Solr.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Mon, Apr 29, 2013 at 11:40 AM, geeky2  wrote:
> hello,
>
> environment: solr 3.5
>
>
> problem statement: when query has "*" appended, it turns case sensitive.
>
> assumption: query should NOT be case sensitive
>
> actual value in database at time of index: 4387828BULK
>
> here is a snapshot of what works and does not work.
>
> what works:
>
>   itemModelNoExactMatchStr:4387828bULk (and any variation of upper and lower
> case letters for *bulk*)
>
>   itemModelNoExactMatchStr:4387828bu*
>   itemModelNoExactMatchStr:4387828bul*
>   itemModelNoExactMatchStr:4387828bulk*
>
>
> what does NOT work:
>
>  itemModelNoExactMatchStr:4387828BU*
>  itemModelNoExactMatchStr:4387828BUL*
>  itemModelNoExactMatchStr:4387828BULK*
>
>
> below are the specifics of my field and fieldType
>
>stored="true"/>
>
>
>  positionIncrementGap="100">
>   
> 
> 
> 
>   
>   
> 
> 
>   
> 
>
> thx
> mark
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: why does "*" affect case sensitivity of query results

2013-04-29 Thread geeky2
was looking in Smiley's book on page 129 and 130.

from the book,

>>
No text analysis is performed on the search word containing the wildcard,
not even lowercasing. So if you want to find a word starting with Sma, then
sma* is required instead of Sma*, assuming the index side of the field's
type
includes lowercasing. This shortcoming is tracked on SOLR-219. Moreover,
if the field that you want to use the wildcard query on is stemmed in the
analysis, then smashing* would not find the original text Smashing because
the stemming process transforms this to smash. Consequently, don't stem.
<<

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801p4059812.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: why does "*" affect case sensitivity of query results

2013-04-29 Thread geeky2
here is the jira link:

https://issues.apache.org/jira/browse/SOLR-219





--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801p4059814.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: why does "*" affect case sensitivity of query results

2013-04-30 Thread Erick Erickson
Actually, look at the referenced JIRA
https://issues.apache.org/jira/browse/SOLR-2438 and you'll see it's
changed in 3.6.

Best
Erick

On Mon, Apr 29, 2013 at 9:36 AM, geeky2  wrote:
> here is the jira link:
>
> https://issues.apache.org/jira/browse/SOLR-219
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801p4059814.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: why does "*" affect case sensitivity of query results

2013-04-30 Thread geeky2
hello erik,

thank you for the info - yes - i did notice ;)

one more reason for us to upgrade from 3.5.

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801p406.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Case-sensitivity issue with search field name

2013-02-28 Thread Shawn Heisey

On 2/28/2013 3:40 PM, hyrax wrote:

I'm using Solr 4.0 and I recently notice an issue that bothers me a lot
which is that if you define a field in your schema named 'HOST' then in the
query you have to specify this field by 'HOST' while if you used 'host' it
would throw an 'undefined field' error.

I have done some googling while I only found a jira ticket which says this
issue had been fixed:  https://issues.apache.org/jira/browse/SOLR-873


I know I can use  to accomplish this but I'm wonder if there a
way to apply this change all the field on the fly not one by one ...


It appears that the issue you have linked is specific to the dataimport 
handler (importing from a database or another structured data source), 
not searching.  I've always read that fields in a Solr schema are case 
sensitive.


My own recommendation is that you pick a standard, either all uppercase 
or all lowercase, and that you stick with it.  I prefer all lowercase 
myself.


Thanks,
Shawn



Re: Solr Case-sensitivity issue with search field name

2013-02-28 Thread Walter Underwood
Lower case is safer than upper case. For unicode, uppercasing is a lossy 
conversion. There are sets of different lower case characters that convert to 
the same upper case character. When you convert back to lower case, you don't 
know which one it was originally.

Always use lower case for text. That avoids some really subtle bugs.

wunder

On Feb 28, 2013, at 3:47 PM, Shawn Heisey wrote:

> On 2/28/2013 3:40 PM, hyrax wrote:
>> I'm using Solr 4.0 and I recently notice an issue that bothers me a lot
>> which is that if you define a field in your schema named 'HOST' then in the
>> query you have to specify this field by 'HOST' while if you used 'host' it
>> would throw an 'undefined field' error.
>> 
>> I have done some googling while I only found a jira ticket which says this
>> issue had been fixed:  https://issues.apache.org/jira/browse/SOLR-873
>> 
>> 
>> I know I can use  to accomplish this but I'm wonder if there a
>> way to apply this change all the field on the fly not one by one ...
> 
> It appears that the issue you have linked is specific to the dataimport 
> handler (importing from a database or another structured data source), not 
> searching.  I've always read that fields in a Solr schema are case sensitive.
> 
> My own recommendation is that you pick a standard, either all uppercase or 
> all lowercase, and that you stick with it.  I prefer all lowercase myself.
> 
> Thanks,
> Shawn
> 






Re: Solr Case-sensitivity issue with search field name

2013-03-01 Thread hyrax
Hi Shawn,
Thanks for your reply.
So you mean the field name can't be case insensitive when specifies in a
query?
I'm gonna stop doing research on this issue if this is confirmed...
Thanks,
Hyrax



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Case-sensitivity-issue-with-search-field-name-tp4043800p4044006.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Case-sensitivity issue with search field name

2013-03-01 Thread hyrax
Hi wunder,
Great advice!
As a matter of fact, I choose to use upper case due to the document I
indexed, but it is really pain in the ass when typing the field names all in
upper case.
I thought there probably would be a way to set field names case-insensitive.
I was wrong, wasn't I?
Thanks,
Hyrax



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Case-sensitivity-issue-with-search-field-name-tp4043800p4044010.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to ignore whitespace/ case sensitivity with dedupe

2011-05-28 Thread Koji Sekiguchi

(11/05/29 8:47), tinman wrote:

Hi all,

I've followed the instructions at this link
http://wiki.apache.org/solr/Deduplication and got the basic dedupe field
working. However, it doesn't seem to recognize case differences or white
space differences even thought I've defined the type of the fields to be
used for dedupe as well as the signature field as followings in schema.xml


   
 
 
   
   
 
 
   
 



and in the solrconfig.xml
 
   true
   false
   signatureField
   name
   org.apache.solr.update.processor.Lookup3Signature
 
 
 
   

I know a possible solution is to lowercase and remove white spaces for the
field "name" before submiting documents to solr, but is there any other
alternatives so that when the following data is given
Name: JOHN SMITH and jOhn  SMITh the documents have the same outcome in
signatureField?


I can't believe this. Those signatures should be different.

Are you sure you see same signatures in signatureField (it should be stored=true
in order to see the result of signature)? Or did you just see those duplicate 
documents
were registered and not checked signatureField by yourself? If latter, it is 
feature.
Because you set overwriteDupes=false and it mean duplication check works on 
uniqueKey field.

koji
--
http://www.rondhuit.com/en/


Re: How to ignore whitespace/ case sensitivity with dedupe

2011-05-28 Thread tinman
By default, stored = true, indexed = true. Any case, this is an example
output from solr search console.


  
1234
JOHN   SMITH 
5430fbe9e6374611
  
1233
   john SMITh
49867a7835ff6741


As you can see, the 2 signature fields are different. And I want the
overrides = false as I want to use field collapsing for removing dedupe at
query time.

Thanks
tinman


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997738.html
Sent from the Solr - User mailing list archive at Nabble.com.


Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread David Neubert
Hi all,

Using SOLR, I believe I have to index the same content 4 times (not desirable) 
into 2 indexes -- and I don't know how you can practically do multiple indexes 
in SOLR (if indeed there is no better solution than 4 indexing runs into two 
indexes?

My need is case-sensitive and case insensitive searches over well formed XML 
content (books), performing exact searches at the paragraph and sentence levels 
-- no errors over approximate boundaries -- the source content has exact 
par/sen tags.

I have already proven a pretty nice solution for par/sen indexing twice into 
the same index in SOLR.  I have added a tags field, and put correlative XML 
tags (comma delimited) into this field (one of which is either a para or sen 
flag) which flags the document (partial) as a paragraph or sentence.  Thus all 
paragraphs of the book are indexed as single document (with its sentences 
combined and concatenated) and then all sentences in the book are indexed again 
as single documents.  Both go into the same SOLR index. I just add an AND 
"tags:para" or "tags:sen" to my search and everything works fine.

The obvious downside to this approach is the 2X indexing, but it does execute 
quite nicely on a single Index using SOLR. This obviously doesn't scale nicely, 
but will do for quite a while probably.

I thought I could live with that

But then I moved on to case sensitive and case-insensitive searches, and my 
research so far is pointing to one index for each case.

So now I have:
(1) 4X in content indexing
(2) 2X in actual SOLR/Lucene indices
(3) I don't know how to practically due multiple indices using SOLR?

If there is a better way of attacking this problem, I would appreciate 
recommendations!!!

Also, I don't know how to do multiple indices in SOLR -- I have heard it might 
be available in 1.3.0.?  If this is my only recourse, please advise me where 
really good documentation is available on building 1.3.0.  I am not admin 
savvy, but I did succeed in getting SOLR up myself and navigation through it 
with the help of this forum.  But I have that building 1.3.0 (as opposed to 
downloading and installing it, like in 1.2.0) is a whole different experience 
and much more complex.

Thanks

Dave





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread Ryan McKinley



So now I have:
(1) 4X in content indexing
(2) 2X in actual SOLR/Lucene indices
(3) I don't know how to practically due multiple indices using SOLR?

If there is a better way of attacking this problem, I would appreciate 
recommendations!!!



I don't quite follow your current approach, but it sounds like you just 
needs some copyFields to index the same content with multiple analyzers.


for example, say you have fields:

 
 stored="false"/>
 stored="false"/>

 

and copy fields:

  
  
  


The 4X indexing cost?  If you *need* to index the content 4 different 
ways, you don't have any way around that - do you?  But is it really a 
big deal?  How often does it need to index?  How big is the data?


I'm not quite following your need for multiple solr indicies, but in 1.3 
it is possible.


ryan


Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread David Neubert
Ryan,

Thanks for your response.  I infer from your response that you can have a 
different analyzer for each field -- I guess I should have figured that out 
--but because I had not thought of that, I concluded that  I needed multiple 
indices (sorry , I am still very new to Solr/Lucene).  

Does such an approach make querying difficult under the following condition: ?

The app that I am replacing (and trying to enhance) has the ability to search 
multiple books at once with sen/par and case sensitivity settings individually 
selectable per book (e.g. default search modes per book).  So with a single 
query request (just the query word(s)), you can search one book by par, with 
case, another by sen w/o case, etc. -- all settable as user defaults.  I need 
to try to figure out how to match that in Solr/Lucene -- I believe that the 
Analyzer approach you suggested requires the use of the same Analzyer at query 
time that was used during indexing.   So if I am hitting multiple fields (in 
the same search request) that invoke different Analyzers -- am I at a dead end, 
and have to result to consequetive multiple queries instead (and sort merge 
results afterwards?)  Or am I just over complicating this?

Dave

- Original Message 
From: Ryan McKinley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, November 10, 2007 2:18:00 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)



> So now I have:
> (1) 4X in content indexing
> (2) 2X in actual SOLR/Lucene indices
> (3) I don't know how to practically due multiple indices using SOLR?
> 
> If there is a better way of attacking this problem, I would
 appreciate recommendations!!!
> 

I don't quite follow your current approach, but it sounds like you just
 
needs some copyFields to index the same content with multiple
 analyzers.

for example, say you have fields:

  
  
  
  

and copy fields:

   
   
   


The 4X indexing cost?  If you *need* to index the content 4 different 
ways, you don't have any way around that - do you?  But is it really a 
big deal?  How often does it need to index?  How big is the data?

I'm not quite following your need for multiple solr indicies, but in
 1.3 
it is possible.

ryan





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread Yonik Seeley
On Nov 10, 2007 4:24 PM, David Neubert <[EMAIL PROTECTED]> wrote:
> So if I am hitting multiple fields (in the same search request) that invoke 
> different Analyzers -- am I at a dead end, and have to result to consequetive 
> multiple queries instead

Solr handles that for you automatically.

> The app that I am replacing (and trying to enhance) has the ability to search 
> multiple books at once
> with sen/par and case sensitivity settings individually selectable per book

You could easily select case sensitivity or not *per query* across all books.
You should step back and see what the requirements actually are (i.e.
the reasons why one needs to be able to select case
sensitive/insensitive on a book level... it doesn't make sense to me
at first blush).

It could be done on a per-book level in solr with a more complex query
structure though...

(+case:sensitive +(normal relevancy query on the case sensitive fields
goes here)) OR (+case:insensitive +(normal relevancy query on the case
insensitive fields goes here))

-Yonik


Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread Ryan McKinley

David Neubert wrote:

Ryan,

Thanks for your response.  I infer from your response that you can have a 
different analyzer for each field


yes!  each field can have its own indexing strategy.


I believe that the Analyzer approach you suggested requires the use 
of the same Analzyer at query time that was used during indexing.  


it does not require the *same* Analyzer - it just requires one that 
generates compatiable tokens.  That is, you may want the indexing to 
split the input into sentences, but the query time analyzer keeps the 
input as a single token.


check the example schema.xml file -- the 'text' field type applies 
synonyms at index time, but does at query time.


re searching acrross multiple fields, don't worry, lucene handles this 
well.  You may want to do that explicitly or with the dismax handler.


I'd suggest you play around with indexing some data.  check the 
analysis.jsp in the admin section.  It is a great tool to help figure 
out what analyzers do at index vs query time.


ryan



Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread David Neubert
Ryan (and others who need something to put them so sleep :) )

Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool 
-- I just was not at all thinking the SOLR/Lucene way.

I need to rethink my whole approach now that I understand (from reviewing the 
schema.xml closer and playing with the Analyser) how compatible index and query 
policies can be applied automatically on a field by field basis by SOLR at both 
index and query time.

I still may have a stumper here, but I need to give it some thought, and may 
return again with another question:

The problem is that my text is book text (fairly large) that ooks very much 
like one would expect:


...
...
..

 elements to the  because in that way I could produce the 
page:line reference in the pre-parsing (again outside of SOLR) and feed it in 
as explict field in the  elements of the  requests.  Therefore at 
query time, I will have the exact page:line corresponding to the start of the 
paragraph or sentence.

But I am beginning to suspect, I was planning to do a lot of work that SOLR can 
do for me.

I will continue to study this and respond when I am a bit clearer, but the 
closer I could get to just submitting the books a chapter at a time -- and 
letting SOLR do the work, the better (cause I have all the books in well formed 
xml at chapter levels).  However, I don't  see yet how I could get par/sen 
granular search result hits, along with their exact page:line coordinates 
unless I approach it by explicitly indexing the pars and sens as single 
documents, not chapters hits, and also return the entire text of the sen or 
par, and highlight the keywords within (for the search result hit).  Once a 
search result hit is selected, it would then act as expected and position into 
the chapter, at the selected reference, highlight again the key words, but this 
time in the context of an entire chapter (the whole document to the user's 
mind).

Even with my new understanding you (and others) have given me, which I can use 
to certainly improve my approach -- it still seems to me that because 
multi-valued fields concatenate text -- even if you use the positionGapIncrment 
feature to prohibit unwanted phrase matches, how do you produce a well definied 
search result hit, bounded by the exact sen or par, unless you index them as 
single documents?

Should I still read up on the payload discussion?

Dave




- Original Message 
From: Ryan McKinley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, November 10, 2007 5:00:43 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)


David Neubert wrote:
> Ryan,
> 
> Thanks for your response.  I infer from your response that you can
 have a different analyzer for each field

yes!  each field can have its own indexing strategy.


> I believe that the Analyzer approach you suggested requires the use 
> of the same Analzyer at query time that was used during indexing.  

it does not require the *same* Analyzer - it just requires one that 
generates compatiable tokens.  That is, you may want the indexing to 
split the input into sentences, but the query time analyzer keeps the 
input as a single token.

check the example schema.xml file -- the 'text' field type applies 
synonyms at index time, but does at query time.

re searching acrross multiple fields, don't worry, lucene handles this 
well.  You may want to do that explicitly or with the dismax handler.

I'd suggest you play around with indexing some data.  check the 
analysis.jsp in the admin section.  It is a great tool to help figure 
out what analyzers do at index vs query time.

ryan






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-10 Thread David Neubert
Yonik (or anyone else)

Do you know where on-line documentation on the +case: syntax is located?  I 
can't seem to find it.

Dave

- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, November 10, 2007 4:56:40 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)


On Nov 10, 2007 4:24 PM, David Neubert <[EMAIL PROTECTED]> wrote:
> So if I am hitting multiple fields (in the same search request) that
 invoke different Analyzers -- am I at a dead end, and have to result to
 consequetive multiple queries instead

Solr handles that for you automatically.

> The app that I am replacing (and trying to enhance) has the ability
 to search multiple books at once
> with sen/par and case sensitivity settings individually selectable
 per book

You could easily select case sensitivity or not *per query* across all
 books.
You should step back and see what the requirements actually are (i.e.
the reasons why one needs to be able to select case
sensitive/insensitive on a book level... it doesn't make sense to me
at first blush).

It could be done on a per-book level in solr with a more complex query
structure though...

(+case:sensitive +(normal relevancy query on the case sensitive fields
goes here)) OR (+case:insensitive +(normal relevancy query on the case
insensitive fields goes here))

-Yonik





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-11 Thread Erik Hatcher
Solr query syntax is documented here: <http://wiki.apache.org/solr/ 
SolrQuerySyntax>


What Yonik is referring to is creating your own "case" field with the  
per-book setting attached at index time.


Erik


On Nov 11, 2007, at 12:55 AM, David Neubert wrote:


Yonik (or anyone else)

Do you know where on-line documentation on the +case: syntax is  
located?  I can't seem to find it.


Dave

- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, November 10, 2007 4:56:40 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and  
case sensitivity)



On Nov 10, 2007 4:24 PM, David Neubert <[EMAIL PROTECTED]> wrote:

So if I am hitting multiple fields (in the same search request) that
 invoke different Analyzers -- am I at a dead end, and have to  
result to

 consequetive multiple queries instead

Solr handles that for you automatically.


The app that I am replacing (and trying to enhance) has the ability

 to search multiple books at once

with sen/par and case sensitivity settings individually selectable

 per book

You could easily select case sensitivity or not *per query* across all
 books.
You should step back and see what the requirements actually are (i.e.
the reasons why one needs to be able to select case
sensitive/insensitive on a book level... it doesn't make sense to me
at first blush).

It could be done on a per-book level in solr with a more complex query
structure though...

(+case:sensitive +(normal relevancy query on the case sensitive fields
goes here)) OR (+case:insensitive +(normal relevancy query on the case
insensitive fields goes here))

-Yonik





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com




Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread Erick Erickson
DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be
useful

For your line number, page number etc perspective, it is possible to index
special guaranteed-to-not-match tokens then use the termdocs/termenum
data, along with SpanQueries to figure this out at search time. For
instance,
coincident with the last term in each line, index the token "$".
Coincident
with the last token of every paragraph index the token "#". If you get
the
offsets of the matching terms, you can quite quickly simply count the number
of line and paragraph tokens using TermDocs/TermEnums and correlate hits
to lines and paragraphs. The trick is to index your special tokens with an
increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this).


Another possibility is to add a special field with each document with the
offsets
of each end-of-sentence and end-of-paragraph offsets (stored, not indexed).
Again, "given the offsets",  you can read in this field and figure out what
line/
paragraph your hits are in.

How suitable either of these is depends on a lot of characteristics of your
particular problem space. I'm not sure either of them is suitable for very
high
volume applications.

Also, I'm approaching this from an in-the-guts-of-lucene perspective, so
don't
even *think* of asking me how to really make this work in SOLR .

Best
Erick

On Nov 11, 2007 12:44 AM, David Neubert <[EMAIL PROTECTED]> wrote:

> Ryan (and others who need something to put them so sleep :) )
>
> Wow -- the light-bulb finally went off -- the Analzyer admin page is very
> cool -- I just was not at all thinking the SOLR/Lucene way.
>
> I need to rethink my whole approach now that I understand (from reviewing
> the schema.xml closer and playing with the Analyser) how compatible index
> and query policies can be applied automatically on a field by field basis by
> SOLR at both index and query time.
>
> I still may have a stumper here, but I need to give it some thought, and
> may return again with another question:
>
> The problem is that my text is book text (fairly large) that ooks very
> much like one would expect:
> 
> 
> ...
> ...
> ..
> 
> 
> The search results need to return exact sentences or paragraphs with their
> exact page:line numbers (which is available in the embedded markup in the
> text).
>
> There were previous responses by others, suggesting I look into payloads,
> but I did not fully understand that -- I may have to re-read those e-mails
> now that I am getting a clearer picture of SOLR/Lucene.
>
> However, the reason I resorted to indexing each paragraph as a single
> document, and then redundantly indexing each sentence as a single document,
> is because I was planning on pre-parsing the text myself (outside of SOLR)
> -- and feeding separate  elements to the  because in that way I
> could produce the page:line reference in the pre-parsing (again outside of
> SOLR) and feed it in as explict field in the  elements of the 
> requests.  Therefore at query time, I will have the exact page:line
> corresponding to the start of the paragraph or sentence.
>
> But I am beginning to suspect, I was planning to do a lot of work that
> SOLR can do for me.
>
> I will continue to study this and respond when I am a bit clearer, but the
> closer I could get to just submitting the books a chapter at a time -- and
> letting SOLR do the work, the better (cause I have all the books in well
> formed xml at chapter levels).  However, I don't  see yet how I could get
> par/sen granular search result hits, along with their exact page:line
> coordinates unless I approach it by explicitly indexing the pars and sens as
> single documents, not chapters hits, and also return the entire text of the
> sen or par, and highlight the keywords within (for the search result hit).
>  Once a search result hit is selected, it would then act as expected and
> position into the chapter, at the selected reference, highlight again the
> key words, but this time in the context of an entire chapter (the whole
> document to the user's mind).
>
> Even with my new understanding you (and others) have given me, which I can
> use to certainly improve my approach -- it still seems to me that because
> multi-valued fields concatenate text -- even if you use the
> positionGapIncrment feature to prohibit unwanted phrase matches, how do you
> produce a well definied search result hit, bounded by the exact sen or par,
> unless you index them as single documents?
>
> Should I still read up on the payload discussion?
>
> Dave
>
>
>
>
> - Original Message 
> From: Ryan McKinley <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Saturday, November

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread David Neubert
Erik,

Probably because of my newness to SOLR/Lucene, I see now what you/Yonik meant 
by "case" field, but I am not clear about your wording "per-book setting 
attached at index time" - would you mind ellaborating on that, so I am clear?

Dave

- Original Message 
From: Erik Hatcher <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Sunday, November 11, 2007 5:21:45 AM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)


Solr query syntax is documented here: <http://wiki.apache.org/solr/ 
SolrQuerySyntax>

What Yonik is referring to is creating your own "case" field with the  
per-book setting attached at index time.

Erik


On Nov 11, 2007, at 12:55 AM, David Neubert wrote:

> Yonik (or anyone else)
>
> Do you know where on-line documentation on the +case: syntax is  
> located?  I can't seem to find it.
>
> Dave
>
> - Original Message 
> From: Yonik Seeley <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Saturday, November 10, 2007 4:56:40 PM
> Subject: Re: Redundant indexing * 4 only solution (for par/sen and  
> case sensitivity)
>
>
> On Nov 10, 2007 4:24 PM, David Neubert <[EMAIL PROTECTED]> wrote:
>> So if I am hitting multiple fields (in the same search request) that
>  invoke different Analyzers -- am I at a dead end, and have to  
> result to
>  consequetive multiple queries instead
>
> Solr handles that for you automatically.
>
>> The app that I am replacing (and trying to enhance) has the ability
>  to search multiple books at once
>> with sen/par and case sensitivity settings individually selectable
>  per book
>
> You could easily select case sensitivity or not *per query* across
 all
>  books.
> You should step back and see what the requirements actually are (i.e.
> the reasons why one needs to be able to select case
> sensitive/insensitive on a book level... it doesn't make sense to me
> at first blush).
>
> It could be done on a per-book level in solr with a more complex
 query
> structure though...
>
> (+case:sensitive +(normal relevancy query on the case sensitive
 fields
> goes here)) OR (+case:insensitive +(normal relevancy query on the
 case
> insensitive fields goes here))
>
> -Yonik
>
>
>
>
>
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread David Neubert
Erik - thanks, I am considering this approach, verses explicit redundant 
indexing -- and am also considering Lucene -- problem is, I am one week into 
both technologies (though have years in the search space) -- wish I could go to 
Hong Kong -- any discounts available anywhere :)

Dave

- Original Message 
From: Erick Erickson <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 12, 2007 2:11:14 PM
Subject: Re: Redundant indexing * 4 only solution (for par/sen and case 
sensitivity)

DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this
 may be
useful

For your line number, page number etc perspective, it is possible to
 index
special guaranteed-to-not-match tokens then use the termdocs/termenum
data, along with SpanQueries to figure this out at search time. For
instance,
coincident with the last term in each line, index the token "$".
Coincident
with the last token of every paragraph index the token "#". If you
 get
the
offsets of the matching terms, you can quite quickly simply count the
 number
of line and paragraph tokens using TermDocs/TermEnums and correlate
 hits
to lines and paragraphs. The trick is to index your special tokens with
 an
increment of 0 (see SynonymAnalyzer in Lucene In Action for more on
 this).


Another possibility is to add a special field with each document with
 the
offsets
of each end-of-sentence and end-of-paragraph offsets (stored, not
 indexed).
Again, "given the offsets",  you can read in this field and figure out
 what
line/
paragraph your hits are in.

How suitable either of these is depends on a lot of characteristics of
 your
particular problem space. I'm not sure either of them is suitable for
 very
high
volume applications.

Also, I'm approaching this from an in-the-guts-of-lucene perspective,
 so
don't
even *think* of asking me how to really make this work in SOLR .

Best
Erick

On Nov 11, 2007 12:44 AM, David Neubert <[EMAIL PROTECTED]> wrote:

> Ryan (and others who need something to put them so sleep :) )
>
> Wow -- the light-bulb finally went off -- the Analzyer admin page is
 very
> cool -- I just was not at all thinking the SOLR/Lucene way.
>
> I need to rethink my whole approach now that I understand (from
 reviewing
> the schema.xml closer and playing with the Analyser) how compatible
 index
> and query policies can be applied automatically on a field by field
 basis by
> SOLR at both index and query time.
>
> I still may have a stumper here, but I need to give it some thought,
 and
> may return again with another question:
>
> The problem is that my text is book text (fairly large) that ooks
 very
> much like one would expect:
> 
> 
> ...
> ...
> ..
> 
> 
> The search results need to return exact sentences or paragraphs with
 their
> exact page:line numbers (which is available in the embedded markup in
 the
> text).
>
> There were previous responses by others, suggesting I look into
 payloads,
> but I did not fully understand that -- I may have to re-read those
 e-mails
> now that I am getting a clearer picture of SOLR/Lucene.
>
> However, the reason I resorted to indexing each paragraph as a single
> document, and then redundantly indexing each sentence as a single
 document,
> is because I was planning on pre-parsing the text myself (outside of
 SOLR)
> -- and feeding separate  elements to the  because in that
 way I
> could produce the page:line reference in the pre-parsing (again
 outside of
> SOLR) and feed it in as explict field in the  elements of the
 
> requests.  Therefore at query time, I will have the exact page:line
> corresponding to the start of the paragraph or sentence.
>
> But I am beginning to suspect, I was planning to do a lot of work
 that
> SOLR can do for me.
>
> I will continue to study this and respond when I am a bit clearer,
 but the
> closer I could get to just submitting the books a chapter at a time
 -- and
> letting SOLR do the work, the better (cause I have all the books in
 well
> formed xml at chapter levels).  However, I don't  see yet how I could
 get
> par/sen granular search result hits, along with their exact page:line
> coordinates unless I approach it by explicitly indexing the pars and
 sens as
> single documents, not chapters hits, and also return the entire text
 of the
> sen or par, and highlight the keywords within (for the search result
 hit).
>  Once a search result hit is selected, it would then act as expected
 and
> position into the chapter, at the selected reference, highlight again
 the
> key words, but this time in the context of an entire chapter (the
 whole
> document to the user's mind).
>
> Even with my new understanding you (and others) have given me, which
 I can
> use to certainly impr

Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread Yonik Seeley
On Nov 12, 2007 2:20 PM, David Neubert <[EMAIL PROTECTED]> wrote:
> Erik - thanks, I am considering this approach, verses explicit redundant 
> indexing -- and am also considering Lucene -

There's not a well defined solution in either IMO.

> - problem is, I am one week into both technologies (though have years in the 
> search space) -- wish I could
> go to Hong Kong -- any discounts available anywhere :)

Unfortunately the OS Summit has been canceled.

-Yonik


Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)

2007-11-12 Thread Chris Hostetter

: > - problem is, I am one week into both technologies (though have years in 
the search space) -- wish I could
: > go to Hong Kong -- any discounts available anywhere :)
: 
: Unfortunately the OS Summit has been canceled.

Or rescheduled to 2008 ... depending on wether you are a half-empty / 
half-full kind of person.

And lets not forget atlanta ... starting today and all...

http://us.apachecon.com/us2007/



-Hoss