Re: RegexQuery Incomplete Results

Ian Lea Tue, 12 May 2009 07:35:04 -0700

The Jakarta regexp package is a separate download.
http://jakarta.apache.org/regexp/index.html.


--
Ian.


On Tue, May 12, 2009 at 3:21 PM, Seid Mohammed <seidy...@gmail.com> wrote:
> I need it similar functionality, but while running the above code it
> breaks after outputing the following
> ========================================================================
> Added Knowing yourself
> Added Old clinic
> Added INSIDE
> Added Not INSIDE
>
> Default 
> regexcapabilities=org.apache.lucene.search.regex.javautilregexcapabilit...@0
>
> org.apache.lucene.search.regex.javautilregexcapabilit...@0
> 0 hits for text:.in
> 2 hits for text:.*in
> 0 hits for text:.IN
> 2 hits for text:.*IN
> org.apache.lucene.search.regex.jakartaregexpcapabilit...@0
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/regexp/RE
>        at 
> org.apache.lucene.search.regex.JakartaRegexpCapabilities.compile(JakartaRegexpCapabilities.java:32)
>        at 
> org.apache.lucene.search.regex.RegexTermEnum.<init>(RegexTermEnum.java:47)
>        at 
> org.apache.lucene.search.regex.RegexQuery.getEnum(RegexQuery.java:59)
>        at 
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:55)
>        at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:162)
>        at org.apache.lucene.search.Query.weight(Query.java:94)
>        at org.apache.lucene.search.Hits.<init>(Hits.java:76)
>        at org.apache.lucene.search.Searcher.search(Searcher.java:50)
>        at org.apache.lucene.search.Searcher.search(Searcher.java:40)
>        at Regex2.main(Regex2.java:43)
> Caused by: java.lang.ClassNotFoundException: org.apache.regexp.RE
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
>        ... 10 more
> ===================================================================
>
> thanks a lot
>
> On 5/11/09, Huntsman84 <tpgarci...@gmail.com> wrote:
>>
>> That's it!!!
>>
>> The problem was with the regular expression, the one I need is ".*IN"!!
>>
>> Thank you so much, I was turning mad... =)
>>
>>
>> Ian Lea wrote:
>>>
>>> The little self-contained program below runs regex queries for a few
>>> regexps against a few phrases for both the java.util and jakarta
>>> regexp packages.
>>>
>>> Output when run with lucene 2.4.1 and jakarta-regexp 1.5 is
>>>
>>> Added Knowing yourself
>>> Added Old clinic
>>> Added INSIDE
>>> Added Not INSIDE
>>>
>>> Default
>>> regexcapabilities=org.apache.lucene.search.regex.javautilregexcapabilit...@0
>>>
>>> org.apache.lucene.search.regex.javautilregexcapabilit...@0
>>> 0 hits for text:.in
>>> 2 hits for text:.*in
>>> 0 hits for text:.IN
>>> 2 hits for text:.*IN
>>> org.apache.lucene.search.regex.jakartaregexpcapabilit...@0
>>> 2 hits for text:.in
>>> 2 hits for text:.*in
>>> 1 hits for text:.IN
>>> 2 hits for text:.*IN
>>>
>>> Hope that helps.
>>>
>>> --
>>> Ian.
>>>
>>>
>>> import org.apache.lucene.index.*;
>>> import org.apache.lucene.store.*;
>>> import org.apache.lucene.document.*;
>>> import org.apache.lucene.analysis.*;
>>> import org.apache.lucene.analysis.standard.*;
>>> import org.apache.lucene.search.*;
>>> import org.apache.lucene.search.regex.*;
>>>
>>> public class luctest {
>>>
>>>     public static void main(String[] _args) throws Exception {
>>>      RAMDirectory rdir = new RAMDirectory();
>>>      IndexWriter writer = new IndexWriter(rdir, new StandardAnalyzer(), 
>>> true);
>>>      String[] docterms = { "Knowing yourself",
>>>                            "Old clinic",
>>>                            "INSIDE",
>>>                            "Not INSIDE" };
>>>
>>>      for (String s : docterms) {
>>>          Document d = new Document();
>>>          d.add(new Field("text",
>>>                          s,
>>>                          Field.Store.YES,
>>>                          Field.Index.NOT_ANALYZED));
>>>          writer.addDocument(d);
>>>          System.out.printf("Added %s\n", s);
>>>      }
>>>      writer.close();
>>>
>>>      IndexSearcher searcher = new IndexSearcher(rdir);
>>>      String[] queries = { ".in", ".*in", ".IN", ".*IN" };
>>>      RegexCapabilities[] rcaps = { new JavaUtilRegexCapabilities(),
>>>                                    new JakartaRegexpCapabilities() };
>>>      RegexQuery qx = new RegexQuery(new Term("x", "x"));
>>>      System.out.printf("\nDefault RegexCapabilities=%s\n\n",
>>>                        qx.getRegexImplementation());
>>>      for (RegexCapabilities rcap : rcaps) {
>>>          System.out.println(rcap);
>>>          for (String s : queries) {
>>>              Term t = new Term("text", s);
>>>              RegexQuery q = new RegexQuery(t);
>>>              q.setRegexImplementation(rcap);
>>>              Hits h = searcher.search(q);
>>>              System.out.printf("%s hits for %s\n",
>>>                                h.length(),
>>>                                q.toString());
>>>          }
>>>      }
>>>     }
>>> }
>>>
>>>
>>> On Mon, May 11, 2009 at 1:39 PM, Huntsman84 <tpgarci...@gmail.com> wrote:
>>>>
>>>> The RegexQuery class uses that package, and for that reason the
>>>> expression
>>>> matches.
>>>>
>>>> If my records contained only one word each, this code would work, but I
>>>> need
>>>> to apply that regular expression to a phrase...
>>>>
>>>>
>>>> Ian Lea wrote:
>>>>>
>>>>> The default regex package is java.util.regex and I can't see anywhere
>>>>> that you tell it to use the Jakarta regexp package.  So I don't think
>>>>> that ".in" will match.  Also, you are storing your contents field as
>>>>> NOT_ANALYZED so you will need to be wary of case sensitivity.  Maybe
>>>>> this is what you want, but maybe not.
>>>>>
>>>>>
>>>>> --
>>>>> Ian.
>>>>>
>>>>>
>>>>> On Mon, May 11, 2009 at 9:00 AM, Huntsman84 <tpgarci...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> This is the code for searching:
>>>>>>
>>>>>> String index = "index";
>>>>>> String field = "contents";
>>>>>> IndexReader reader = IndexReader.open(index);
>>>>>> Searcher searcher = new IndexSearcher(reader);
>>>>>>
>>>>>> System.out.println("Enter query: ");
>>>>>> String line = ".IN.";//in jakarta regexp this is like * IN *
>>>>>> RegexQuery rxquery = new RegexQuery(new Term(field,line));
>>>>>> Hits hits = searcher.search(rxquery);
>>>>>>
>>>>>> if(hits!=null){
>>>>>>    for(int k = 0; k<100 && k<hits.length(); k++){
>>>>>>        if(hits.doc(k)!=null)
>>>>>>
>>>>>>  System.out.println(hits.doc(k).getField("contents").stringValue());
>>>>>>    }
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> And this is the part of creating the index:
>>>>>>
>>>>>>
>>>>>> File directory = new File("index");
>>>>>> IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(),
>>>>>> true,
>>>>>>                            IndexWriter.MaxFieldLength.LIMITED);
>>>>>> List<String> records = getRecords();//returns a list of record values
>>>>>> from
>>>>>> database, all of them are phrases
>>>>>> Iterator<String> i = records.iterator();
>>>>>> while(i.hasNext()){
>>>>>>           Document doc = new Document();
>>>>>>           doc.add(new Field(field, i.next(), Field.Store.YES,
>>>>>> Field.Index.NOT_ANALYZED));
>>>>>>        writer.addDocument(doc);
>>>>>> }
>>>>>> writer.optimize();
>>>>>> writer.close();
>>>>>>
>>>>>>
>>>>>>
>>>>>> This code works as I want but just matching with the first word of the
>>>>>> phrase. I think the problem is the index building, but I don't know how
>>>>>> to
>>>>>> fix it...
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Thank you so much!!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Steven A Rowe wrote:
>>>>>>>
>>>>>>> On 5/8/2009 at 9:13 AM, Ian Lee wrote:
>>>>>>>> I'm surprised that it matches either - don't you need ".*in" where .*
>>>>>>>> means match any character zero or more times?  See the javadoc for
>>>>>>>> java.util.regex.Pattern, or for Jakarta Regexp if you are using that
>>>>>>>> package.
>>>>>>>>
>>>>>>>> Unless you're an expert in regexps it is probably worth playing with
>>>>>>>> them outside your lucene code to start with e.g. with simple
>>>>>>>> String.matches(regexp) calls.  They can take some getting used to.
>>>>>>>> And try to avoid anything with backslashes if you can!
>>>>>>>
>>>>>>> The java.util.regex.Pattern implementation (the default RegexQuery
>>>>>>> implementation) actually uses Matcher.lookingAt(), which is equivalent
>>>>>>> to
>>>>>>> prepending a "^" anchor to the beginning of the pattern, so if
>>>>>>> Huntsman84
>>>>>>> is using the default implementation, then I agree with Ian: I'm
>>>>>>> surprised
>>>>>>> it matches either.
>>>>>>>
>>>>>>> However, the Jakarta Regexp implementation uses RE.match(), which does
>>>>>>> *not* require a beginning-of-string match.
>>>>>>>
>>>>>>> Hunstman84, are you using the Jakarta Regexp implementation?  If so,
>>>>>>> then
>>>>>>> like you, I'm surprised it's not matching both :).
>>>>>>>
>>>>>>> It would be useful to see some real code, including how you index your
>>>>>>> records.
>>>>>>>
>>>>>>> Steve
>>>>>>>
>>>>>>>> On Fri, May 8, 2009 at 1:42 PM, Huntsman84 <tpgarci...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > I am using RegexQuery for searching in a set of records wich are
>>>>>>>> > phrases of several words each. My aim is to find any phrase that
>>>>>>>> > contains the given group of letters (e.g. "in"). For that case,
>>>>>>>> > I am building the query with the regular expression ".in.", so it
>>>>>>>> > should return all phrases with contain "in", but the search only
>>>>>>>> > matches with the first word of the phrase.
>>>>>>>> >
>>>>>>>> > For example, if my records are "Knowing yourself" and "Old
>>>>>>>> > clinic", the correct search would return 2 matches, but it only
>>>>>>>> > matches with "Knowing yourself".
>>>>>>>> >
>>>>>>>> > How could I fix this?
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23478720.html
>>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23482532.html
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23486350.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> "RABI ZIDNI ILMA"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: RegexQuery Incomplete Results

Reply via email to