Re: Fine Tuning Lucene implementation

Askar Zaidi Wed, 25 Jul 2007 10:26:47 -0700

Hey guys,

One last question and I think I'll have an optimized algorithm.


How can I build a query in my program ?

This is what I am doing:

QueryParser queryParser = new QueryParser("contents", new
StandardAnalyzer());

 queryParser.setDefaultOperator(QueryParser.Operator.AND);

 Query q = queryParser.parse(queryString);

So doing : System.out.println(q) shows:

+contents:harvard +contents:business +contents:review

I'd like to modify Query q to read:

+contents:harvard +contents:business +contents:review +itemID: (id passed in
the search method)

So this would pick the one document I need from the Index and give me the
score. I don't have to iterate over Hits.

Any clues ? I can't find any examples on query building .

thanks !

Askar


On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> Yes, you can do that.
>
>
> On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote:
>
> > Heres what I mean:
> >
> > http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields
> >
> > title:"The Right Way" AND text:go
> >
> >
> > Although, I am not searching for the title "the right way" , I am
> > looking
> > for the score by specifying a unique field (itemID).
> >
> > when I do System.out.println(query);
> >
> > I get:
> >
> > +contents:Harvard +contents:Business + contents: Review
> >
> > Can I just add:
> >
> > +contents:Harvard +contents:Business + contents: Review
> > +itemID=id       ??
> >
> > That query would just return one document.
> >
> > On 7/25/07, Askar Zaidi <[EMAIL PROTECTED]> wrote:
> >>
> >> Instead of refactoring the code, would there be a way to just
> >> modify the
> >> query in each search routine ?
> >>
> >> Such as, "search contents:<text> and item:<itemID>"; This means it
> >> would
> >> just collect the score of that one document whose itemID field =
> >> itemID
> >> passed from while( rs.next()).
> >>
> >> I just need to collect the score of the <itemID> already in the
> >> index.
> >>
> >> Would there be a way to modify the query ? Add a clause ?
> >>
> >> thanks,
> >> Askar
> >>
> >>
> >> On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >>>
> >>> So, you really want a single Lucene score (based on the scores of
> >>> your 4 fields) for every itemID, correct?  And this score
> >>> consists of
> >>> scoring the title, tag, summary and body against some keywords
> >>> correct?
> >>>
> >>> Here's what I would do:
> >>>
> >>> while (rs.next())
> >>> {
> >>>      doc = getDocument(itemId);  // Get your document, including
> >>> contents from your database, no need even to put them in Lucene,
> >>> although you could
> >>>      add the doc to a MemoryIndex (see contrib/memory)
> >>>      Run your 4 searches against that memory index to get your
> >>> score.  Even better, combine your query into a single query that
> >>> searches all 4 fields at once, then Lucene will combine the score
> >>> for
> >>> you
> >>> }
> >>>
> >>> MemoryIndex info can be found at http://lucene.zones.apache.org:
> >>> 8080/
> >>> hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/
> >>> package-summary.html
> >>>
> >>> -Grant
> >>>
> >>> On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote:
> >>>
> >>>> Hi Grant,
> >>>>
> >>>> Thanks for the response. Heres what I am trying to accomplish:
> >>>>
> >>>> 1. Iterate over itemID (unique) in the database using one SQL
> >>>> query.
> >>>> 2. For every itemID found, run 4 searches on Lucene Index.
> >>>> 3. doTagSearch(itemID....) ; collect score
> >>>> 4. doTitleSearch(itemID...) ; collect score
> >>>> 5. doSummarySearch(itemID...) ; collect score
> >>>> 6. doBodySearch(itemID....) ; collect score
> >>>>
> >>>> These scores are then added and I get a total score for each unique
> >>>> item in
> >>>> the database.
> >>>>
> >>>> Lucene Index has: <itemID><tags><title><summary><contents>
> >>>>
> >>>> So if I am running a body search, I have 92 hits from over 300
> >>>> documents for
> >>>> a query. I already know my hit with the <itemID> .
> >>>>
> >>>> For instance, from step (1) if itemID 16 is passed to all the 4
> >>>> searches, I
> >>>> just need to get the score of the document which has itemID field =
> >>>> 16. I
> >>>> don't have to iterate over all the hits.
> >>>>
> >>>> I suppose I have to change my query to look for <contents> where
> >>>> itemID=16.
> >>>> Can you guide me as to how to do it ?
> >>>>
> >>>> thanks a ton,
> >>>>
> >>>> Askar
> >>>>
> >>>> On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> >>>>>
> >>>>> Hi Askar,
> >>>>>
> >>>>> I suggest we take a step back, and ask the question, what are you
> >>>>> trying to accomplish?  That is, what is your application trying to
> >>>>> do?  Forget the code, etc. just explain what you want the end
> >>>>> result
> >>>>> to be and we can work from there.   Based on what you have
> >>>>> described,
> >>>>> I am not sure you need access to the hits.  It seems like you just
> >>>>> need to make better queries.
> >>>>>
> >>>>> Is your itemID a unique identifier?  If yes, then you shouldn't
> >>>>> need
> >>>>> to loop over hits at all, as you should only ever have one
> >>>>> result IF
> >>>>> your query contains a required term.  Also, if this is the
> >>>>> case, why
> >>>>> do you need to do a search at all?  Haven't you already identified
> >>>>> the items of interest when you did your select query in the
> >>>>> database?  Or is it that you want to score the item based on some
> >>>>> terms as well.  If that is the case, there are other ways of doing
> >>>>> this and we can discuss them.
> >>>>>
> >>>>> -Grant
> >>>>>
> >>>>> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote:
> >>>>>
> >>>>>> Hey Guys,
> >>>>>>
> >>>>>> I need to know how I can use the HitCollector class ? I am using
> >>>>>> Hits and
> >>>>>> looping over all the possible document hits (turns out its 92
> >>>>>> times
> >>>>>> I am
> >>>>>> looping; for 300 searches, its 300*92 !!). Can I avoid this using
> >>>>>> HitCollector ? I can't seem to understand how its used.
> >>>>>>
> >>>>>> thanks a lot,
> >>>>>>
> >>>>>> Askar
> >>>>>>
> >>>>>> On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote:
> >>>>>>>
> >>>>>>> Askar,
> >>>>>>> why do you need to add +id:<idWeCareAbout>?
> >>>>>>> thanks,
> >>>>>>> dt,
> >>>>>>> www.ejinz.com
> >>>>>>> search engine news forms
> >>>>>>> ----- Original Message -----
> >>>>>>> From: "Askar Zaidi" <[EMAIL PROTECTED] >
> >>>>>>> To: <java-user@lucene.apache.org>; <[EMAIL PROTECTED]>
> >>>>>>> Sent: Wednesday, July 25, 2007 12:39 AM
> >>>>>>> Subject: Re: Fine Tuning Lucene implementation
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hey Hira ,
> >>>>>>>>
> >>>>>>>> Thanks so much for the reply. Much appreciate it.
> >>>>>>>>
> >>>>>>>> Quote:
> >>>>>>>>
> >>>>>>>> Would it be possible to just include a query clause?
> >>>>>>>>   - i.e., instead of just contents:<userQuery>, also add
> >>>>>>>> +id:<idWeCareAbout>
> >>>>>>>>
> >>>>>>>> How can I do that ?
> >>>>>>>>
> >>>>>>>> I see my query as :
> >>>>>>>>
> >>>>>>>> +contents:harvard +contents:business +contents:review
> >>>>>>>>
> >>>>>>>> where the search phrase was: harvard business review
> >>>>>>>>
> >>>>>>>> Now how can I add +id:<idWeCareAbout>  ??
> >>>>>>>>
> >>>>>>>> This would give me that one exact document I am looking
> >>>>>>>> for , for
> >>>>>>>> that
> >>>>>>> id.
> >>>>>>>> I
> >>>>>>>> don't have to iterate through hits.
> >>>>>>>>
> >>>>>>>> thanks,
> >>>>>>>>
> >>>>>>>> Askar
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 7/24/07, N. Hira < [EMAIL PROTECTED]> wrote:
> >>>>>>>>>
> >>>>>>>>> I'm no expert on this (so please accept the comments in that
> >>>>>>>>> context)
> >>>>>>>>> but 2 things seem weird to me:
> >>>>>>>>>
> >>>>>>>>> 1.  Iterating over each hit is an expensive proposition.  I've
> >>>>>>>>> often
> >>>>>>>>> seen people recommending a HitCollector.
> >>>>>>>>>
> >>>>>>>>> 2.  It seems that doBodySearch() is essentially saying, do
> >>>>>>>>> this
> >>>>>>>>> search
> >>>>>>>>> and return the score pertinent to this ID (using an exhaustive
> >>>>>>>>> loop).
> >>>>>>>>> Would it be possible to just include a query clause?
> >>>>>>>>>     - i.e., instead of just contents:<userQuery>, also add
> >>>>>>>>> +id:<idWeCareAbout>
> >>>>>>>>>
> >>>>>>>>> In general though, I think your algorithm seems inefficient
> >>>>>>>>> (if I
> >>>>>>>>> understand it correctly):-- if I want to search for one term
> >>>>>>>>> among 3 in
> >>>>>>>>> a "collection" of 300 documents (as defined by some external
> >>>>>>> attribute),
> >>>>>>>>> I will wind up executing 300 x 3 searches, and for each search
> >>>>>>>>> that is
> >>>>>>>>> executed, I will iterate over every Hit, even if I've already
> >>>>>>>>> found the
> >>>>>>>>> one that I "care about".
> >>>>>>>>>
> >>>>>>>>> What would break if you:
> >>>>>>>>> 1.  Included "creator" in the Lucene index (or, filtered
> >>>>>>>>> out the
> >>>>>>>>> Hits
> >>>>>>>>> using a BitSet or something like it)
> >>>>>>>>> 2.  Executed 1 search
> >>>>>>>>> 3.  Collected the results of the first N Hits (where N is some
> >>>>>>>>> reasonable limit, like 100 or 500)
> >>>>>>>>>
> >>>>>>>>> -h
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote:
> >>>>>>>>>
> >>>>>>>>>> Sure.
> >>>>>>>>>>
> >>>>>>>>>>  public float doBodySearch(Searcher searcher,String query,
> >>>>>>>>>> int
> >>>>>>>>>> id){
> >>>>>>>>>>
> >>>>>>>>>>                  try{
> >>>>>>>>>>                                 score = search(searcher,
> >>>>>>>>>> query,id);
> >>>>>>>>>>                      }
> >>>>>>>>>>                       catch(IOException io){}
> >>>>>>>>>>                       catch(ParseException pe){}
> >>>>>>>>>>
> >>>>>>>>>>                       return score;
> >>>>>>>>>>
> >>>>>>>>>>                 }
> >>>>>>>>>>
> >>>>>>>>>>  private float search(Searcher searcher, String queryString,
> >>>>>>>>>> int id)
> >>>>>>>>>> throws ParseException, IOException {
> >>>>>>>>>>
> >>>>>>>>>>         // Build a Query object
> >>>>>>>>>>
> >>>>>>>>>>         QueryParser queryParser = new QueryParser("contents",
> >>>>>>>>>> new
> >>>>>>>>>> KeywordAnalyzer());
> >>>>>>>>>>
> >>>>>>>>>>         queryParser.setDefaultOperator
> >>>>>>>>>> ( QueryParser.Operator.AND);
> >>>>>>>>>>
> >>>>>>>>>>         Query query = queryParser.parse(queryString);
> >>>>>>>>>>
> >>>>>>>>>>         // Search for the query
> >>>>>>>>>>
> >>>>>>>>>>         Hits hits = searcher.search(query);
> >>>>>>>>>>         Document doc = null;
> >>>>>>>>>>
> >>>>>>>>>>         // Examine the Hits object to see if there were any
> >>>>>>>>>> matches
> >>>>>>>>>>         int hitCount = hits.length();
> >>>>>>>>>>
> >>>>>>>>>>                 for(int i=0;i<hitCount;i++){
> >>>>>>>>>>                 doc = hits.doc(i);
> >>>>>>>>>>                 String str = doc.get("item");
> >>>>>>>>>>                 int tmp = Integer.parseInt (str);
> >>>>>>>>>>                 if(tmp==id)
> >>>>>>>>>>                 score = hits.score(i);
> >>>>>>>>>>                 }
> >>>>>>>>>>
> >>>>>>>>>>         return score;
> >>>>>>>>>>     }
> >>>>>>>>>>
> >>>>>>>>>> I really need to optimize doBodySearch(...) as this takes the
> >>>>>>>>>> most
> >>>>>>>>>> time.
> >>>>>>>>>>
> >>>>>>>>>> thanks guys,
> >>>>>>>>>> Askar
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote:
> >>>>>>>>>>
> >>>>>>>>>>         Could you show us the relevant source from
> >>>>>>>>>> doBodySearch()?
> >>>>>>>>>>
> >>>>>>>>>>         -h
> >>>>>>>>>>
> >>>>>>>>>>         On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote:
> >>>>>>>>>>> I ran some tests and it seems that the slowness is from
> >>>>>>>>>>         Lucene calls when I
> >>>>>>>>>>> do "doBodySearch", if I remove that call, Lucene gives me
> >>>>>>>>>>         results in 5
> >>>>>>>>>>> seconds. otherwise it takes about 50 seconds.
> >>>>>>>>>>>
> >>>>>>>>>>> But I need to do Body search and that field contains lots
> >>>>>>> of
> >>>>>>>>>>         text. The field
> >>>>>>>>>>> is <contents>. How can I optimize that ?
> >>>>>>>>>>>
> >>>>>>>>>>> thanks,
> >>>>>>>>>>> Askar
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ----------------------------------------------------------------
> >>>>>>> ---
> >>>>>>> --
> >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>>>> For additional commands, e-mail: java-user-
> >>>>>>> [EMAIL PROTECTED]
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> --------------------------
> >>>>> Grant Ingersoll
> >>>>> Center for Natural Language Processing
> >>>>> http://www.cnlp.org/tech/lucene.asp
> >>>>>
> >>>>> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/
> >>>>> LuceneFAQ
> >>>>>
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------------
> >>>>> ---
> >>>
> >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>>>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>>>
> >>>>>
> >>>
> >>> ------------------------------------------------------
> >>> Grant Ingersoll
> >>> http://www.grantingersoll.com/
> >>> http://lucene.grantingersoll.com
> >>> http://www.paperoftheweek.com/
> >>>
> >>>
> >>>
> >>> --------------------------------------------------------------------
> >>> -
> >>> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >>> For additional commands, e-mail: [EMAIL PROTECTED]
> >>>
> >>>
> >>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Fine Tuning Lucene implementation

Reply via email to