Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

H. Wilson Thu, 26 Aug 2010 09:23:26 -0700

Finally! I have been hacking away at this here and there for months,trying all different analyzers or not-using analyzers and modifying myqueries all to no avail! Since I always like precise examples when I amsearching forums, I will post my (nearly) exact solution both for othersand so that Ard might verify that this was indeed what he meant.

Ard, I was hoping you could embellish a little on why we would duplicatethe property? (I didn't actually do it to get this working perfectly)You lost me a little there, was it for efficiency? Thanks for everything!


H. Wilson

repository.xml (modified both SearchIndex tags to include anindexingConfiguration):


   <SearchIndex
   class="org.apache.jackrabbit.core.query.lucene.SearchIndex">

       ....
       <param name="indexingConfiguration"
       value="${rep.home}/indexing_configuration.xml"/>

   </SearchIndex>


indexing_configuration.xml:

   <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0";>
   <analyzers>
   <analyzer
   class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
   <property>fullName</property>
   </analyzer>
   </analyzers>
   </configuration>


LowerCaseKeywordAnalyzer.java:

   package org.mycompany.lucene.analysis;
        import java.io.Reader;
        import org.apache.lucene.analysis.KeywordAnalyzer;
        import org.apache.lucene.analysis.LowerCaseFilter;
        import org.apache.lucene.analysis.TokenStream;

   public class LowerCaseKeywordAnalyzer extends KeywordAnalyzer {

        public TokenStream tokenStream ( String field, final Reader
   reader  ) {
            TokenStream keywordTokenStream = super.tokenStream (field,
   reader);
            return ( new LowerCaseFilter ( keywordTokenStream ) );
        }
   }


Our search class has a method which then does the following:

   public OurParameter[] getOurParameters (String searchTerm, String
   srchField ) { //srchField in this case was fullName

       TransientRepository repository = new TransientRepository (
       OUR_REPO_CONFIG, OUR_REPO_LOCATION);
       Session session = repository.login ();
       List<Class> classes = new ArrayList<Class>();
       classes.add (OurParameter.class);
       Mapper mapper = new AnnotationMapperImpl (classes);
       ObjectContentManager ocm = new ObjectContentManagerImpl
       (session, mapper);
       queryManager = ocm.getQueryManager();
       FilterImpl filter = (FilterImpl)queryManager.createFilter
       (OurParameter.class);
       filter.addContains ( srchField,
       
org.apache.jackrabbit.util.Text.escapeIllegalXpathSearchChars(searchTerm).replaceAll
       ("'","''"));
       // (that last was replace all single ticks with two ticks, I
       honestly can't remember why though)
       Query query = queryManager.createQuery (filter);
       Collection<OurParameter> resultsCollection =
       (Collection<OurParameter>)ocm.getObjects(query);

       //convert to an array, do some other stuff, and return...

   }



On 08/26/2010 10:42 AM, Ard Schrijvers wrote:

On Thu, Aug 26, 2010 at 3:53 PM, H. Wilson<[email protected]>  wrote:

  Ard,

I have this same problem, however my scenario involves underscores rather
than hyphens. Although since Chris seems to be seeing the same exact

It is because hyphens just as underscores are tokens the Standard
Lucene Analyzer splits on. This combined with query expansion that
happens for wildcard searches in lucene causes your issuess:

behavior as I was, I imagine we are both stuck on the same issue. After
scouring the forums for the solution, and not seeing your mentioned
solution, I actually posted my problem as detailed as possible here (
http://markmail.org/message/yh72wqd5b2hbr3j6 ) and received no response.
jcr:like was not an option for me, in this case, as our client wanted the
option for case-insensitive searches. Is there any chance you could please
narrow down where-about the post was which already covered this? Thanks for

I can't seem to find my post again. But, I'll give you a quite simple solution:

If you want to have the normal indexing of the property for normal
searching, but also want to have the yyy* option, you need to
duplicate the property also in another property. If your property,
like

.North.South.East.WestLand

is only needed for the one you describe with wildcard searching, you
only need it once. Now, suppose, your property is called myProp.

To your configuration.xml add:

<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0";>
   <analyzers>
         <analyzer
class="org.mycompany.lucene.analysis.LowerCaseKeywordAnalyzer">
             <property>myProp</property>
         </analyzer>
   </analyzers>
</configuration>

Your LowerCaseKeywordAnalyzer is very simple: it extends
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/KeywordAnalyzer.html
and in the method

  TokenStream tokenStream(String fieldName,Reader reader)

after calling the super, you invoke Lucene's LowerCaseFilter.

That is all (after you do a re-index of your repository). Since now a
-, or _ or ~ or whatever is not seen as a token to split on, but you
still use lowercase filter, you can do exactly what you want.

Do the words need the be split on spaces however? No problem, just add
a WhiteSpaceTokenizer from lucene. It is actually pretty simple,

Hope this helps,

Regards Ard

your time.

*H. Wilson*


On 08/26/2010 04:59 AM, Ard Schrijvers wrote:

Hello,

You can search the archives (mail from me) for wildcard searching
things related below. There was someone having similar issues. I
explained the wildcard difficulties. Take a look at jcr:like for your
usecases

Regards Ard

On Thu, Aug 26, 2010 at 10:19 AM, Dunstall, Christopher
<[email protected]>    wrote:

Hi all,

I'm having some trouble with an XPath query, where I'm searching for
users with hyphens in their name.

I'm using:
jcr:contains(*/*/*,'query')

And it returns some odd results.

I have two users, Sophie-Allen and Sophie-Anne. When I search for
'sophie', I get back users back. Ok, fine, but if I search for 'sophie-a'
(with the hyphen escaped as 'sophie\-a' as per the JSR-170 Spec) I get zero
results returned.  Oddly, if I search for either 'sophie-allen' or
'sophie-anne' I get the respective user details back fine. Shouldn't I get
both users back when escaping the hyphen? Have I missed something in the
spec?

One other odd thing is the addition of an asterisk (*).  Searching for
'soph' and 'soph*' return the same result (both users), but if I search for
'sophie-allen*', I get zero results, unlike when searching for just
'sophie-allen'. Searching for 'sophie-a*' has the same result as without the
asterisk, i.e. nothing.

The JSR-170 spec doesn't say anything (that I can find) but is the
asterisk a wildcard in the jcr:contains function or does it serve some other
purpose?

Your assistance is greatly appreciated,

Regards,

Chris Dunstall | Service Support - Applications
Technology Integration/OLE Virtual Team
Division of Information Technology | Charles Sturt University | Bathurst,
NSW, Australia

Ph: 02 63384818 | Fax: 02 63384181

Re: Problems with hyphen in JSR-170 XPath query using jcr:contains

Reply via email to