I've got the customised Analyzer and Tokenizer working, but it seems I'm back 
at square one, maybe even further back because now it looks like it's being 
case sensitive.

My Analyzer:

public class HyphenKeywordAnalyzer extends KeywordAnalyzer {
  private static final Logger LOGGER = 
LoggerFactory.getLogger(HyphenKeywordAnalyzer.class);
  
  public TokenStream tokenStream(String field, final Reader reader) {
    LOGGER.info("Custom Analyzer [" + field + "], [" + ((reader != null) ? 
reader.toString() : "") + "]");
    
    TokenStream keywordTokenStream = new HyphenKeywordTokenizer(reader);
    return keywordTokenStream;
    //return (new LowerCaseFilter(keywordTokenStream));
  }
}

My HyphenKeywordTokenizer class is practically a direct copy of 
KeywordTokenizer, where it emits the entire input as a single token.  As you 
can see above, I'm not using the lower case filter, just to see what happens.

Once again, I have a user named 'Sophie-Anne' 'Roberts' and a user named 'Bob' 
'Arlington-Smythe'.

A search for 'Sophie-Anne' produces the user's record, however, a search for 
'sophie-anne' does not (returns nothing), as does 'Sophie-A' and now, even 
'Sophie' or 'Sophie*'. Should I be using double quotes in the query now? >From 
what H. Wilson has found, it doesn't look like it will solve the problem.

The query being used is:
//*...@sling:resourceType="sakai/user-profile" and (jcr:contains(., 
'Sophie\-Anne') or jcr:contains(*/*/*,'Sophie\-Anne'))] order by @jcr:score 
descending]


Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst, NSW

Ph: 02 63384818 | Fax: 02 63384181


-----Original Message-----
From: H. Wilson [mailto:[email protected]] 
Sent: Wednesday, 1 September 2010 6:47 AM
To: [email protected]
Subject: Re: Problems with hyphen in JSR-170 XPath query using jcr:contains


On 08/31/2010 03:05 AM, Ard Schrijvers wrote:
>
>> Given the following parameters in the repository:
>>
>>    .North.South.East.WestLand
>>    .North.South.East.West_Land
>>    .North.South.East.West Land    //yes that's a space
>>
>> The following exact name, case sensitive queries worked as expected for each
>> of the three parameters:
>>
>>    filter.orJCRExpression ("jcr:like(@" + srchField
>> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case sens.
> jcr:like does not depend on any analyser but on the stored field, so
> this is not strange that it still works.
I expected this too, I just try to be as thorough as possible when 
posting anywhere. I am disappointed enough I haven't figured this out on 
my own.
>> The following exact name query, case insensitive, worked for only the
>> parameter with a fullName with a whitespace character:
>>
>>    filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>
>> The following exact name queries, case insensitive, stopped working for the
>> fullnames WITHOUT a whitespace character:
>>
>>    filter.addContains ( srchField,
>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>
>> Again, the only change I made was to the analyzer, I didn't remove my
>> "workaround" yet, and I just want to confirm I properly changed the analyzer
>> to figure out how the tokens were working. Oh I should note, the output from
>> the Analyzer only showed one Token per field, which I believe is what we
>> were looking for. Which leaves me as perplexed as before.
>>
>> LowerCaseKeywordAnalyzer.java:
>>
>>    ...
>>
>>    public TokenStream tokenStream ( String field, final Reader reader  ) {
>>             System.out.println ("TOKEN STREAM for field: " + field);
>>             TokenStream keywordTokenStream = super.tokenStream (field,
>> reader);
>>
>>         //changed for testing
>>             TokenStream lowerCaseStream =  new LowerCaseFilter (
>> keywordTokenStream ) ;
>>             final Token reusableToken = new Token();
>>             try {
>>                 Token mytoken = lowerCaseStream.next (reusableToken);
>>                 while ( mytoken != null  ) {
>>                     System.out.println ("[" + mytoken.term() + "]");
>>                     mytoken = lowerCaseStream.next (mytoken);
>>                 }
>>                 //lowerCaseStream.reset();  //uncommenting this did not
>> change results.
>>             }
>>             catch  (IOException ioe) {
>>                 System.err.println ("ERROR: " + ioe.toString());
>>             }
>>
> It's a stream!! So, your keywordTokenStream is now empty. Call reset()
> on the keywordTokenStream before using it again.
>
> Regards Ard
>
>>             return (new LowerCaseFilter ( keywordTokenStream ) );
>>         }
>>
>>    ...
I was real excited when I saw your email this morning. However, 
resetting keywordTokenStream as the last line in the "try" resulted in 
no change. I also tried uncommenting the lowerCaseStream.reset line in 
an act of desperation with no difference. I must be missing something 
completely obvious at this point... look at a problem too long and the 
obvious fails to jump out at you...

H. Wilson
>> Thanks.
>>
>> H. Wilson
>>
>> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<[email protected]>    wrote:
>>>>   Ard,
>>>>
>>>> You are absolutely right.. and this didn't make sense to me either. I
>>>> think
>>>> I was too worn out from my week and too excited to have code that
>>>> "worked"
>>>> to notice the obvious... this must be a workaround. However, I will need
>>>> a
>>>> little guidance on how to inspect the tokens. I have Luke, but never
>>>> really
>>>> understood how to use it properly. Could you give me a clear list of
>>>> steps,
>>>> or point me to a resource I missed, on how I would go about inspecting
>>>> tokens during insert/search? Thanks.
>>> I'd just print them to your console with Token#term() or use a
>>> debugger . If you do that during indexing and searching, I think you
>>> must see some difference in the token that explains *why* Lucene
>>> doesn't find a hit for your usecase with spaces.
>>>
>>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>>> as the field value prefixing: It is unfortunate and not completely
>>> necessary any more but has some historical reasons from Lucene back in
>>> the days when it could not handle very many unique fieldnames
>>>
>>> Regards Ard
>>>
>>>> H. Wilson
>>>>
>>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>> Hello,
>>>>>
>>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<[email protected]>
>>>>>   wrote:
>>>>>>   OK, well I got the spaces part figured out, and will post it for
>>>>>> anyone
>>>>>> who
>>>>>> needs it. Putting quotes around the spaces unfortunately did not work.
>>>>>>   During testing, I determined that if you performed the following query
>>>>>> for
>>>>>> the exact fullName property:
>>>>>>
>>>>>>     filter.addContains ( @fullName,
>>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land"));
>>>>>>
>>>>>> It would return nothing. But tweak it a little and add a wildcard, and
>>>>>> it
>>>>>> would return results:
>>>>>>
>>>>>>    filter.addContains ( @fullName,
>>>>>>    '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>> Lan*"));
>>>>> This does not make sense...see below
>>>>>
>>>>>> But since I did not want to throw in wild cards where they might not be
>>>>>> wanted, if a search string contained spaces, did not contain wild cards
>>>>>> and
>>>>>> the user was not concerned with case sensitivity, I used the
>>>>>> fn:lower-case.
>>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>>> for
>>>>>> case sensitive and case insensitive searching) .
>>>>>>
>>>>>> public OurParameter[] getOurParameters (boolean
>>>>>> performCaseSensitiveSearch,
>>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>>> fullName
>>>>>>
>>>>>>    .....
>>>>>>
>>>>>>    if ( performCaseSensitiveSearch) {
>>>>>>
>>>>>>        //jcr:like for case sensitive
>>>>>>        filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>>
>>>>>>    }
>>>>>>    else {
>>>>>>
>>>>>>        //only use fn:lower-case if there is spaces, with NO wild cards
>>>>>>
>>>>>>        if ( searchTerm.contains (" ")&&        !searchTerm.contains
>>>>>> ("*")&&
>>>>>>   !searchTerm.contains ("?") ) {
>>>>>>
>>>>>>            filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>        else {
>>>>>>
>>>>>>            //jcr:contains for case insensitive
>>>>>>            filter.addContains ( srchField,
>>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>>
>>>>>>        }
>>>>>>
>>>>>>    }
>>>>> This seems to me a workaround around the real problem, because, it
>>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>>> created by your analyser. Make sure you inspect the tokens during
>>>>> indexing (just store something) and during searching: just search in
>>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>>> something with Text.escapeIllegalXpathSearchChars though it seems that
>>>>> it should leave spaces untouched
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>
>>>>>>    ....
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Hope that helps anyone who needs it.
>>>>>>
>>>>>> H. Wilson
>>>>>>
>>>>>>>> OK so it looks like I have one other issue. Using the configuration
>>>>>>>> as
>>>>>>>> posted below and sticking to my previous examples, with the addition
>>>>>>>> of
>>>>>>>> one
>>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>>
>>>>>>>>    .North.South.East.WestLand
>>>>>>>>    .North.South.East.West_Land
>>>>>>>>    .North.South.East.West Land    //yes that's a space
>>>>>>>>
>>>>>>>> ...using a jcr:contains, with exact name search with NO wild cards:
>>>>>>>> the
>>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>>
>>>>>>>>    filter.addContains(@fullName,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>>> Land") +"'));
>>>>>>> I think the space in a contains is seen as an AND by the
>>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am not
>>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>>> though
>>>>>>>
>>>>>>> Regards Ard
>>>>>>>
>>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>>> creating
>>>>>>>> one token, plus combined with escaping the Illegal Characters (i.e.
>>>>>>>> spaces),
>>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>>
>>>>>>>> H. Wilson

Reply via email to