Hey guys, One last question and I think I'll have an optimized algorithm.
How can I build a query in my program ? This is what I am doing: QueryParser queryParser = new QueryParser("contents", new StandardAnalyzer()); queryParser.setDefaultOperator(QueryParser.Operator.AND); Query q = queryParser.parse(queryString); So doing : System.out.println(q) shows: +contents:harvard +contents:business +contents:review I'd like to modify Query q to read: +contents:harvard +contents:business +contents:review +itemID: (id passed in the search method) So this would pick the one document I need from the Index and give me the score. I don't have to iterate over Hits. Any clues ? I can't find any examples on query building . thanks ! Askar On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > Yes, you can do that. > > > On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote: > > > Heres what I mean: > > > > http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields > > > > title:"The Right Way" AND text:go > > > > > > Although, I am not searching for the title "the right way" , I am > > looking > > for the score by specifying a unique field (itemID). > > > > when I do System.out.println(query); > > > > I get: > > > > +contents:Harvard +contents:Business + contents: Review > > > > Can I just add: > > > > +contents:Harvard +contents:Business + contents: Review > > +itemID=id ?? > > > > That query would just return one document. > > > > On 7/25/07, Askar Zaidi <[EMAIL PROTECTED]> wrote: > >> > >> Instead of refactoring the code, would there be a way to just > >> modify the > >> query in each search routine ? > >> > >> Such as, "search contents:<text> and item:<itemID>"; This means it > >> would > >> just collect the score of that one document whose itemID field = > >> itemID > >> passed from while( rs.next()). > >> > >> I just need to collect the score of the <itemID> already in the > >> index. > >> > >> Would there be a way to modify the query ? Add a clause ? > >> > >> thanks, > >> Askar > >> > >> > >> On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >>> > >>> So, you really want a single Lucene score (based on the scores of > >>> your 4 fields) for every itemID, correct? And this score > >>> consists of > >>> scoring the title, tag, summary and body against some keywords > >>> correct? > >>> > >>> Here's what I would do: > >>> > >>> while (rs.next()) > >>> { > >>> doc = getDocument(itemId); // Get your document, including > >>> contents from your database, no need even to put them in Lucene, > >>> although you could > >>> add the doc to a MemoryIndex (see contrib/memory) > >>> Run your 4 searches against that memory index to get your > >>> score. Even better, combine your query into a single query that > >>> searches all 4 fields at once, then Lucene will combine the score > >>> for > >>> you > >>> } > >>> > >>> MemoryIndex info can be found at http://lucene.zones.apache.org: > >>> 8080/ > >>> hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/ > >>> package-summary.html > >>> > >>> -Grant > >>> > >>> On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote: > >>> > >>>> Hi Grant, > >>>> > >>>> Thanks for the response. Heres what I am trying to accomplish: > >>>> > >>>> 1. Iterate over itemID (unique) in the database using one SQL > >>>> query. > >>>> 2. For every itemID found, run 4 searches on Lucene Index. > >>>> 3. doTagSearch(itemID....) ; collect score > >>>> 4. doTitleSearch(itemID...) ; collect score > >>>> 5. doSummarySearch(itemID...) ; collect score > >>>> 6. doBodySearch(itemID....) ; collect score > >>>> > >>>> These scores are then added and I get a total score for each unique > >>>> item in > >>>> the database. > >>>> > >>>> Lucene Index has: <itemID><tags><title><summary><contents> > >>>> > >>>> So if I am running a body search, I have 92 hits from over 300 > >>>> documents for > >>>> a query. I already know my hit with the <itemID> . > >>>> > >>>> For instance, from step (1) if itemID 16 is passed to all the 4 > >>>> searches, I > >>>> just need to get the score of the document which has itemID field = > >>>> 16. I > >>>> don't have to iterate over all the hits. > >>>> > >>>> I suppose I have to change my query to look for <contents> where > >>>> itemID=16. > >>>> Can you guide me as to how to do it ? > >>>> > >>>> thanks a ton, > >>>> > >>>> Askar > >>>> > >>>> On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: > >>>>> > >>>>> Hi Askar, > >>>>> > >>>>> I suggest we take a step back, and ask the question, what are you > >>>>> trying to accomplish? That is, what is your application trying to > >>>>> do? Forget the code, etc. just explain what you want the end > >>>>> result > >>>>> to be and we can work from there. Based on what you have > >>>>> described, > >>>>> I am not sure you need access to the hits. It seems like you just > >>>>> need to make better queries. > >>>>> > >>>>> Is your itemID a unique identifier? If yes, then you shouldn't > >>>>> need > >>>>> to loop over hits at all, as you should only ever have one > >>>>> result IF > >>>>> your query contains a required term. Also, if this is the > >>>>> case, why > >>>>> do you need to do a search at all? Haven't you already identified > >>>>> the items of interest when you did your select query in the > >>>>> database? Or is it that you want to score the item based on some > >>>>> terms as well. If that is the case, there are other ways of doing > >>>>> this and we can discuss them. > >>>>> > >>>>> -Grant > >>>>> > >>>>> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote: > >>>>> > >>>>>> Hey Guys, > >>>>>> > >>>>>> I need to know how I can use the HitCollector class ? I am using > >>>>>> Hits and > >>>>>> looping over all the possible document hits (turns out its 92 > >>>>>> times > >>>>>> I am > >>>>>> looping; for 300 searches, its 300*92 !!). Can I avoid this using > >>>>>> HitCollector ? I can't seem to understand how its used. > >>>>>> > >>>>>> thanks a lot, > >>>>>> > >>>>>> Askar > >>>>>> > >>>>>> On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote: > >>>>>>> > >>>>>>> Askar, > >>>>>>> why do you need to add +id:<idWeCareAbout>? > >>>>>>> thanks, > >>>>>>> dt, > >>>>>>> www.ejinz.com > >>>>>>> search engine news forms > >>>>>>> ----- Original Message ----- > >>>>>>> From: "Askar Zaidi" <[EMAIL PROTECTED] > > >>>>>>> To: <java-user@lucene.apache.org>; <[EMAIL PROTECTED]> > >>>>>>> Sent: Wednesday, July 25, 2007 12:39 AM > >>>>>>> Subject: Re: Fine Tuning Lucene implementation > >>>>>>> > >>>>>>> > >>>>>>>> Hey Hira , > >>>>>>>> > >>>>>>>> Thanks so much for the reply. Much appreciate it. > >>>>>>>> > >>>>>>>> Quote: > >>>>>>>> > >>>>>>>> Would it be possible to just include a query clause? > >>>>>>>> - i.e., instead of just contents:<userQuery>, also add > >>>>>>>> +id:<idWeCareAbout> > >>>>>>>> > >>>>>>>> How can I do that ? > >>>>>>>> > >>>>>>>> I see my query as : > >>>>>>>> > >>>>>>>> +contents:harvard +contents:business +contents:review > >>>>>>>> > >>>>>>>> where the search phrase was: harvard business review > >>>>>>>> > >>>>>>>> Now how can I add +id:<idWeCareAbout> ?? > >>>>>>>> > >>>>>>>> This would give me that one exact document I am looking > >>>>>>>> for , for > >>>>>>>> that > >>>>>>> id. > >>>>>>>> I > >>>>>>>> don't have to iterate through hits. > >>>>>>>> > >>>>>>>> thanks, > >>>>>>>> > >>>>>>>> Askar > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On 7/24/07, N. Hira < [EMAIL PROTECTED]> wrote: > >>>>>>>>> > >>>>>>>>> I'm no expert on this (so please accept the comments in that > >>>>>>>>> context) > >>>>>>>>> but 2 things seem weird to me: > >>>>>>>>> > >>>>>>>>> 1. Iterating over each hit is an expensive proposition. I've > >>>>>>>>> often > >>>>>>>>> seen people recommending a HitCollector. > >>>>>>>>> > >>>>>>>>> 2. It seems that doBodySearch() is essentially saying, do > >>>>>>>>> this > >>>>>>>>> search > >>>>>>>>> and return the score pertinent to this ID (using an exhaustive > >>>>>>>>> loop). > >>>>>>>>> Would it be possible to just include a query clause? > >>>>>>>>> - i.e., instead of just contents:<userQuery>, also add > >>>>>>>>> +id:<idWeCareAbout> > >>>>>>>>> > >>>>>>>>> In general though, I think your algorithm seems inefficient > >>>>>>>>> (if I > >>>>>>>>> understand it correctly):-- if I want to search for one term > >>>>>>>>> among 3 in > >>>>>>>>> a "collection" of 300 documents (as defined by some external > >>>>>>> attribute), > >>>>>>>>> I will wind up executing 300 x 3 searches, and for each search > >>>>>>>>> that is > >>>>>>>>> executed, I will iterate over every Hit, even if I've already > >>>>>>>>> found the > >>>>>>>>> one that I "care about". > >>>>>>>>> > >>>>>>>>> What would break if you: > >>>>>>>>> 1. Included "creator" in the Lucene index (or, filtered > >>>>>>>>> out the > >>>>>>>>> Hits > >>>>>>>>> using a BitSet or something like it) > >>>>>>>>> 2. Executed 1 search > >>>>>>>>> 3. Collected the results of the first N Hits (where N is some > >>>>>>>>> reasonable limit, like 100 or 500) > >>>>>>>>> > >>>>>>>>> -h > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote: > >>>>>>>>> > >>>>>>>>>> Sure. > >>>>>>>>>> > >>>>>>>>>> public float doBodySearch(Searcher searcher,String query, > >>>>>>>>>> int > >>>>>>>>>> id){ > >>>>>>>>>> > >>>>>>>>>> try{ > >>>>>>>>>> score = search(searcher, > >>>>>>>>>> query,id); > >>>>>>>>>> } > >>>>>>>>>> catch(IOException io){} > >>>>>>>>>> catch(ParseException pe){} > >>>>>>>>>> > >>>>>>>>>> return score; > >>>>>>>>>> > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> private float search(Searcher searcher, String queryString, > >>>>>>>>>> int id) > >>>>>>>>>> throws ParseException, IOException { > >>>>>>>>>> > >>>>>>>>>> // Build a Query object > >>>>>>>>>> > >>>>>>>>>> QueryParser queryParser = new QueryParser("contents", > >>>>>>>>>> new > >>>>>>>>>> KeywordAnalyzer()); > >>>>>>>>>> > >>>>>>>>>> queryParser.setDefaultOperator > >>>>>>>>>> ( QueryParser.Operator.AND); > >>>>>>>>>> > >>>>>>>>>> Query query = queryParser.parse(queryString); > >>>>>>>>>> > >>>>>>>>>> // Search for the query > >>>>>>>>>> > >>>>>>>>>> Hits hits = searcher.search(query); > >>>>>>>>>> Document doc = null; > >>>>>>>>>> > >>>>>>>>>> // Examine the Hits object to see if there were any > >>>>>>>>>> matches > >>>>>>>>>> int hitCount = hits.length(); > >>>>>>>>>> > >>>>>>>>>> for(int i=0;i<hitCount;i++){ > >>>>>>>>>> doc = hits.doc(i); > >>>>>>>>>> String str = doc.get("item"); > >>>>>>>>>> int tmp = Integer.parseInt (str); > >>>>>>>>>> if(tmp==id) > >>>>>>>>>> score = hits.score(i); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> return score; > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> I really need to optimize doBodySearch(...) as this takes the > >>>>>>>>>> most > >>>>>>>>>> time. > >>>>>>>>>> > >>>>>>>>>> thanks guys, > >>>>>>>>>> Askar > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote: > >>>>>>>>>> > >>>>>>>>>> Could you show us the relevant source from > >>>>>>>>>> doBodySearch()? > >>>>>>>>>> > >>>>>>>>>> -h > >>>>>>>>>> > >>>>>>>>>> On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote: > >>>>>>>>>>> I ran some tests and it seems that the slowness is from > >>>>>>>>>> Lucene calls when I > >>>>>>>>>>> do "doBodySearch", if I remove that call, Lucene gives me > >>>>>>>>>> results in 5 > >>>>>>>>>>> seconds. otherwise it takes about 50 seconds. > >>>>>>>>>>> > >>>>>>>>>>> But I need to do Body search and that field contains lots > >>>>>>> of > >>>>>>>>>> text. The field > >>>>>>>>>>> is <contents>. How can I optimize that ? > >>>>>>>>>>> > >>>>>>>>>>> thanks, > >>>>>>>>>>> Askar > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ---------------------------------------------------------------- > >>>>>>> --- > >>>>>>> -- > >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>>>> For additional commands, e-mail: java-user- > >>>>>>> [EMAIL PROTECTED] > >>>>>>> > >>>>>>> > >>>>> > >>>>> -------------------------- > >>>>> Grant Ingersoll > >>>>> Center for Natural Language Processing > >>>>> http://www.cnlp.org/tech/lucene.asp > >>>>> > >>>>> Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/ > >>>>> LuceneFAQ > >>>>> > >>>>> > >>>>> > >>>>> ------------------------------------------------------------------ > >>>>> --- > >>> > >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>>>> For additional commands, e-mail: [EMAIL PROTECTED] > >>>>> > >>>>> > >>> > >>> ------------------------------------------------------ > >>> Grant Ingersoll > >>> http://www.grantingersoll.com/ > >>> http://lucene.grantingersoll.com > >>> http://www.paperoftheweek.com/ > >>> > >>> > >>> > >>> -------------------------------------------------------------------- > >>> - > >>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>> For additional commands, e-mail: [EMAIL PROTECTED] > >>> > >>> > >> > > ------------------------------------------------------ > Grant Ingersoll > http://www.grantingersoll.com/ > http://lucene.grantingersoll.com > http://www.paperoftheweek.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >