On Mon, Aug 30, 2010 at 6:46 PM, H. Wilson <[email protected]> wrote:
> Ard,
>
> I don't have exact results for you yet because I am seeing unexpected
> behavior and I would like to make sure I am using the TokenStream/Token
> classes correctly. To verify the Tokens, I modified JUST my custom analyzer
> to look like below. I did NOT remove my "workaround" code, as I wanted to
> make sure that adding this code to the analyzer would allow it to compile,
> run and still behave as it did before. However, when I added _just_ this
> code, my results changed so strangely that it led me to believe I was using
> TokenStream incorrectly. Can you confirm? The change in results was:
> previously all my test queries worked, whereas only adding the below code to
> the analyzer led to the following:
>
> Given the following parameters in the repository:
>
> .North.South.East.WestLand
> .North.South.East.West_Land
> .North.South.East.West Land //yes that's a space
>
> The following exact name, case sensitive queries worked as expected for each
> of the three parameters:
>
> filter.orJCRExpression ("jcr:like(@" + srchField
> +",'"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')"); //case sens.
jcr:like does not depend on any analyser but on the stored field, so
this is not strange that it still works.
>
> The following exact name query, case insensitive, worked for only the
> parameter with a fullName with a whitespace character:
>
> filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>
> The following exact name queries, case insensitive, stopped working for the
> fullnames WITHOUT a whitespace character:
>
> filter.addContains ( srchField,
> Text.escapeIllegalXpathSearchChars(searchTerm));
>
> Again, the only change I made was to the analyzer, I didn't remove my
> "workaround" yet, and I just want to confirm I properly changed the analyzer
> to figure out how the tokens were working. Oh I should note, the output from
> the Analyzer only showed one Token per field, which I believe is what we
> were looking for. Which leaves me as perplexed as before.
>
> LowerCaseKeywordAnalyzer.java:
>
> ...
>
> public TokenStream tokenStream ( String field, final Reader reader ) {
> System.out.println ("TOKEN STREAM for field: " + field);
> TokenStream keywordTokenStream = super.tokenStream (field,
> reader);
>
> //changed for testing
> TokenStream lowerCaseStream = new LowerCaseFilter (
> keywordTokenStream ) ;
> final Token reusableToken = new Token();
> try {
> Token mytoken = lowerCaseStream.next (reusableToken);
> while ( mytoken != null ) {
> System.out.println ("[" + mytoken.term() + "]");
> mytoken = lowerCaseStream.next (mytoken);
> }
> //lowerCaseStream.reset(); //uncommenting this did not
> change results.
> }
> catch (IOException ioe) {
> System.err.println ("ERROR: " + ioe.toString());
> }
>
It's a stream!! So, your keywordTokenStream is now empty. Call reset()
on the keywordTokenStream before using it again.
Regards Ard
> return (new LowerCaseFilter ( keywordTokenStream ) );
> }
>
> ...
>
> Thanks.
>
> H. Wilson
>
> On 08/30/2010 09:38 AM, Ard Schrijvers wrote:
>>
>> On Mon, Aug 30, 2010 at 3:30 PM, H. Wilson<[email protected]> wrote:
>>>
>>> Ard,
>>>
>>> You are absolutely right.. and this didn't make sense to me either. I
>>> think
>>> I was too worn out from my week and too excited to have code that
>>> "worked"
>>> to notice the obvious... this must be a workaround. However, I will need
>>> a
>>> little guidance on how to inspect the tokens. I have Luke, but never
>>> really
>>> understood how to use it properly. Could you give me a clear list of
>>> steps,
>>> or point me to a resource I missed, on how I would go about inspecting
>>> tokens during insert/search? Thanks.
>>
>> I'd just print them to your console with Token#term() or use a
>> debugger . If you do that during indexing and searching, I think you
>> must see some difference in the token that explains *why* Lucene
>> doesn't find a hit for your usecase with spaces.
>>
>> Luke is hard to use for the multi-index jackrabbit indexing, as well
>> as the field value prefixing: It is unfortunate and not completely
>> necessary any more but has some historical reasons from Lucene back in
>> the days when it could not handle very many unique fieldnames
>>
>> Regards Ard
>>
>>> H. Wilson
>>>
>>> On 08/30/2010 03:30 AM, Ard Schrijvers wrote:
>>>>
>>>> Hello,
>>>>
>>>> On Fri, Aug 27, 2010 at 9:06 PM, H. Wilson<[email protected]>
>>>> wrote:
>>>>>
>>>>> OK, well I got the spaces part figured out, and will post it for
>>>>> anyone
>>>>> who
>>>>> needs it. Putting quotes around the spaces unfortunately did not work.
>>>>> During testing, I determined that if you performed the following query
>>>>> for
>>>>> the exact fullName property:
>>>>>
>>>>> filter.addContains ( @fullName,
>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West Land"));
>>>>>
>>>>> It would return nothing. But tweak it a little and add a wildcard, and
>>>>> it
>>>>> would return results:
>>>>>
>>>>> filter.addContains ( @fullName,
>>>>> '"+Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>> Lan*"));
>>>>
>>>> This does not make sense...see below
>>>>
>>>>> But since I did not want to throw in wild cards where they might not be
>>>>> wanted, if a search string contained spaces, did not contain wild cards
>>>>> and
>>>>> the user was not concerned with case sensitivity, I used the
>>>>> fn:lower-case.
>>>>> So I ended up with the following excerpt (our clients wanted options
>>>>> for
>>>>> case sensitive and case insensitive searching) .
>>>>>
>>>>> public OurParameter[] getOurParameters (boolean
>>>>> performCaseSensitiveSearch,
>>>>> String searchTerm, String srchField ) { //srchField in this case was
>>>>> fullName
>>>>>
>>>>> .....
>>>>>
>>>>> if ( performCaseSensitiveSearch) {
>>>>>
>>>>> //jcr:like for case sensitive
>>>>> filter.orJCRExpression ("jcr:like(@" + srchField +",
>>>>> '"+Text.escapeIllegalXpathSearchChars (searchTerm)+"')");
>>>>>
>>>>> }
>>>>> else {
>>>>>
>>>>> //only use fn:lower-case if there is spaces, with NO wild cards
>>>>>
>>>>> if ( searchTerm.contains (" ")&& !searchTerm.contains
>>>>> ("*")&&
>>>>> !searchTerm.contains ("?") ) {
>>>>>
>>>>> filter.addJCRExpression ("fn:lower-case(@"+srchField+") =
>>>>> '"+Text.escapeIllegalXpathSearchChars(searchTerm.toLowerCase())+"'");
>>>>>
>>>>> }
>>>>>
>>>>> else {
>>>>>
>>>>> //jcr:contains for case insensitive
>>>>> filter.addContains ( srchField,
>>>>> Text.escapeIllegalXpathSearchChars(searchTerm));
>>>>>
>>>>> }
>>>>>
>>>>> }
>>>>
>>>> This seems to me a workaround around the real problem, because, it
>>>> just doesn't make sense to me. Can you inspect the tokens that are
>>>> created by your analyser. Make sure you inspect the tokens during
>>>> indexing (just store something) and during searching: just search in
>>>> the property. I am quite sure you'll see the issue then. Perhaps
>>>> something with Text.escapeIllegalXpathSearchChars though it seems that
>>>> it should leave spaces untouched
>>>>
>>>> Regards Ard
>>>>
>>>>
>>>>> ....
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>> Hope that helps anyone who needs it.
>>>>>
>>>>> H. Wilson
>>>>>
>>>>>>> OK so it looks like I have one other issue. Using the configuration
>>>>>>> as
>>>>>>> posted below and sticking to my previous examples, with the addition
>>>>>>> of
>>>>>>> one
>>>>>>> with whitespace. With the following three in our repository:
>>>>>>>
>>>>>>> .North.South.East.WestLand
>>>>>>> .North.South.East.West_Land
>>>>>>> .North.South.East.West Land //yes that's a space
>>>>>>>
>>>>>>> ...using a jcr:contains, with exact name search with NO wild cards:
>>>>>>> the
>>>>>>> first two return properly, but the last one yields no result.
>>>>>>>
>>>>>>> filter.addContains(@fullName,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> '"+org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(".North.South.East.West
>>>>>>> Land") +"'));
>>>>>>
>>>>>> I think the space in a contains is seen as an AND by the
>>>>>> Jackrabbit/Lucene QueryParser. I should test this however as I am not
>>>>>> sure. Perhaps you can put quotes around it, not sure if that works
>>>>>> though
>>>>>>
>>>>>> Regards Ard
>>>>>>
>>>>>>> According to the Lucene documentation, KeywordAnalyzer should be
>>>>>>> creating
>>>>>>> one token, plus combined with escaping the Illegal Characters (i.e.
>>>>>>> spaces),
>>>>>>> shouldn't this search work? Thanks again.
>>>>>>>
>>>>>>> H. Wilson
>