Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Ard Schrijvers Tue, 31 Aug 2010 00:06:31 -0700

On Mon, Aug 30, 2010 at 6:46 PM, H. Wilson <[email protected]> wrote:
>  Ard,
>
> I don't have exact results for you yet because I am seeing unexpected
> behavior and I would like to make sure I am using the TokenStream/Token
> classes correctly. To verify the Tokens, I modified JUST my custom analyzer
> to look like below. I did NOT remove my "workaround" code, as I wanted to
> make sure that adding this code to the analyzer would allow it to compile,
> run and still behave as it did before. However, when I added _just_ this
> code, my results changed so strangely that it led me to believe I was using
> TokenStream incorrectly. Can you confirm? The change in results was:
> previously all my test queries worked, whereas only adding the below code to
> the analyzer led to the following:
>
> Given the following parameters in the repository:
>
>   .North.South.East.WestLand
>   .North.South.East.West_Land
>   .North.South.East.West Land    //yes that's a space
>
> The following exact name, case sensitive queries worked as expected for each
> of the three parameters:
>
>   filter.orJCRExpression ("jcr:like(@" + srchField
> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");  //case sens.


jcr:like does not depend on any analyser but on the stored field, so
this is not strange that it still works.

>
> The following exact name query, case insensitive, worked for only the
> parameter with a fullName with a whitespace character:
>
>   filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>
> The following exact name queries, case insensitive, stopped working for the
> fullnames WITHOUT a whitespace character:
>
>   filter.addContains ( srchField,
> Text.escapeIllegalXpathSearchChars(searchTerm));
>
> Again, the only change I made was to the analyzer, I didn't remove my
> "workaround" yet, and I just want to confirm I properly changed the analyzer
> to figure out how the tokens were working. Oh I should note, the output from
> the Analyzer only showed one Token per field, which I believe is what we
> were looking for. Which leaves me as perplexed as before.
>
> LowerCaseKeywordAnalyzer.java:
>
>   ...
>
>   public TokenStream tokenStream ( String field, final Reader reader  ) {
>            System.out.println ("TOKEN STREAM for field: " + field);
>            TokenStream keywordTokenStream = super.tokenStream (field,
> reader);
>
>        //changed for testing
>            TokenStream lowerCaseStream =  new LowerCaseFilter (
> keywordTokenStream ) ;
>            final Token reusableToken = new Token();
>            try {
>                Token mytoken = lowerCaseStream.next (reusableToken);
>                while ( mytoken != null  ) {
>                    System.out.println ("[" + mytoken.term() + "]");
>                    mytoken = lowerCaseStream.next (mytoken);
>                }
>                //lowerCaseStream.reset();  //uncommenting this did not
> change results.
>            }
>            catch  (IOException ioe) {
>                System.err.println ("ERROR: " + ioe.toString());
>            }
>

It's a stream!! So, your keywordTokenStream is now empty. Call reset()
on the keywordTokenStream before using it again.

Regards Ard

>            return (new LowerCaseFilter ( keywordTokenStream ) );
>        }
>
>   ...
>
> Thanks.
>
> H. Wilson
>
> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>
>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<[email protected]>  wrote:
>>>
>>>  Ard,
>>>
>>> You are absolutely right.. and this didn't make sense to me either. I
>>> think
>>> I was too worn out from my week and too excited to have code that
>>> "worked"
>>> to notice the obvious... this must be a workaround. However, I will need
>>> a
>>> little guidance on how to inspect the tokens. I have Luke, but never
>>> really
>>> understood how to use it properly. Could you give me a clear list of
>>> steps,
>>> or point me to a resource I missed, on how I would go about inspecting
>>> tokens during insert/search? Thanks.
>>
>> I'd just print them to your console with Token#term() or use a
>> debugger . If you do that during indexing and searching, I think you
>> must see some difference in the token that explains *why* Lucene
>> doesn't find a hit for your usecase with spaces.
>>
>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>> as the field value prefixing: It is unfortunate and not completely
>> necessary any more but has some historical reasons from Lucene back in
>> the days when it could not handle very many unique fieldnames
>>
>> Regards Ard
>>
>>> H. Wilson
>>>
>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<[email protected]>
>>>>  wrote:
>>>>>
>>>>>  OK, well I got the spaces part figured out, and will post it for
>>>>> anyone
>>>>> who
>>>>> needs it. Putting quotes around the spaces unfortunately did not work.
>>>>>  During testing, I determined that if you performed the following query
>>>>> for
>>>>> the exact fullName property:
>>>>>
>>>>>    filter.addContains ( @fullName,
>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land"));
>>>>>
>>>>> It would return nothing. But tweak it a little and add a wildcard, and
>>>>> it
>>>>> would return results:
>>>>>
>>>>>   filter.addContains ( @fullName,
>>>>>   '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>> Lan*"));
>>>>
>>>> This does not make sense...see below
>>>>
>>>>> But since I did not want to throw in wild cards where they might not be
>>>>> wanted, if a search string contained spaces, did not contain wild cards
>>>>> and
>>>>> the user was not concerned with case sensitivity, I used the
>>>>> fn:lower-case.
>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>> for
>>>>> case sensitive and case insensitive searching) .
>>>>>
>>>>> public OurParameter[] getOurParameters (boolean
>>>>> performCaseSensitiveSearch,
>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>> fullName
>>>>>
>>>>>   .....
>>>>>
>>>>>   if ( performCaseSensitiveSearch) {
>>>>>
>>>>>       //jcr:like for case sensitive
>>>>>       filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>
>>>>>   }
>>>>>   else {
>>>>>
>>>>>       //only use fn:lower-case if there is spaces, with NO wild cards
>>>>>
>>>>>       if ( searchTerm.contains (" ")&&      !searchTerm.contains
>>>>> ("*")&&
>>>>>  !searchTerm.contains ("?") ) {
>>>>>
>>>>>           filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>
>>>>>       }
>>>>>
>>>>>       else {
>>>>>
>>>>>           //jcr:contains for case insensitive
>>>>>           filter.addContains ( srchField,
>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>
>>>>>       }
>>>>>
>>>>>   }
>>>>
>>>> This seems to me a workaround around the real problem, because, it
>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>> created by your analyser. Make sure you inspect the tokens during
>>>> indexing (just store something) and during searching: just search in
>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>> something with Text.escapeIllegalXpathSearchChars though it seems that
>>>> it should leave spaces untouched
>>>>
>>>> Regards Ard
>>>>
>>>>
>>>>>   ....
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>> Hope that helps anyone who needs it.
>>>>>
>>>>> H. Wilson
>>>>>
>>>>>>> OK so it looks like I have one other issue. Using the configuration
>>>>>>> as
>>>>>>> posted below and sticking to my previous examples, with the addition
>>>>>>> of
>>>>>>> one
>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>
>>>>>>>   .North.South.East.WestLand
>>>>>>>   .North.South.East.West_Land
>>>>>>>   .North.South.East.West Land    //yes that's a space
>>>>>>>
>>>>>>> ...using a jcr:contains, with exact name search with NO wild cards:
>>>>>>> the
>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>
>>>>>>>   filter.addContains(@fullName,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>> Land") +"'));
>>>>>>
>>>>>> I think the space in a contains is seen as an AND by the
>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am not
>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>> though
>>>>>>
>>>>>> Regards Ard
>>>>>>
>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>> creating
>>>>>>> one token, plus combined with escaping the Illegal Characters (i.e.
>>>>>>> spaces),
>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>
>>>>>>> H. Wilson
>

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Reply via email to