Thanks, make sense. Just another question about the memoryIndex. In your example you said I can do memoryIndex. getReader().terms(); but in fact there is no public access to the reader from memory index... If this is not possible, I will list the docs terms while I'm indexing.
Mélanie -----Original Message----- From: markharw00d [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 4:34 PM To: java-user@lucene.apache.org Subject: Re: Reverse search >>I just want to make sure there is no API either No, but your code looks like it should do the job. That code can be improved by something like [psuedo code]: query.extractTerms(terms); if(query instanceof PhraseQuery) { //find and index rarest term only using an existing index (corpusReader) int rarestDf=Integer.MAX; Term rarestTerm=null; for(Term term:terms){ int df= corpusReader.docFreq(term); if(df<rarestDf) { rarestDf=df; rarestTerm=term; } } //add just the rarest term doc.add(new Field(rarestTerm.field(),rarestTerm.text(), Field.Store.NO,Field.Index.TOKENIZED); } else { //add all terms } Melanie Langlois wrote: > Mark, > When, I extract the terms from my query, I can not use add them directly? I > have to do something like: > Set<Term> terms=new HashSet<Term>(); > query.extractTerms(terms); > Document doc=new Document(); > for(Term term:terms){ > doc.add(new > Field(term.field(),term.text(),Field.Store.NO,Field.Index.TOKENIZED); > } > > I just want to make sure there is no API either create the document from term > or to index the term directly. > > Thanks > Mélanie > > -----Original Message----- > From: markharw00d [mailto:[EMAIL PROTECTED] > Sent: Monday, March 26, 2007 12:36 AM > To: java-user@lucene.apache.org > Subject: Re: Reverse search > > > On app startup: > 1) parse all Queries and place in an array. > 2) Create a RAMIndex containing a doc for each query with content > consisting of the query's terms (see Query.extractTerms). For optimal > performance only index the most rare term for queries with multiple > mandatory criteria e.g. PhraseQuerys. "Most rare" can be determined by > looking at IndexReader.docFreq(t) using an existing index which is > representative of your type of content. > 3) For any queries that can't be handled by 2) e.g. FuzzyQueries - add > to list of "run always queries". > > Whenever you receive a new document: > 1) Put it in a MemoryIndex > 2) Get a list of the document's terms by calling > memoryIndex.getReader().terms(); > 3) For each term hit your query RAMIndex and get > queryIndexReader.termDocs(term) - this will give you the ids of queries > that need to be run - you can use the doc id to index straight into your > parsed queries array. > 4) Run all queries found in 3) and all those held in your "run always" > list against the MemoryIndex containing your new document > > Hope this helps, > Mark > > > Melanie Langlois wrote: > >> Hi Mark, >> If I follow you, I should list the key terms in my incoming document, then >> select the queries which contains these key terms, and then run those >> queries on my index ? If this is correct there is two things I don't >> understand: >> -how do I know which term is a key term in my document ? >> -how can I select the queries? Should I index them in a separate index? >> >> Thanks, >> >> >> Mélanie Langlois >> >> -----Original Message----- >> From: mark harwood [mailto:[EMAIL PROTECTED] >> Sent: Friday, March 23, 2007 11:19 PM >> To: java-user@lucene.apache.org >> Subject: Re: Reverse search >> >> Bear in mind that the million queries you run on the MemoryIndex can be >> shortlisted if you place those queries in a RAMIndex and use the source >> document's terms to "query the queries". The list of unique terms for your >> document is readily available in the MemoryIndex's TermEnum. >> You can take this list and find "likely related queries" to execute from >> your Query index. >> Note that for phrase queries or other forms of query with multiple mandatory >> terms you should only index one of the terms (preferably the rarest) to >> ensure that your query is not needlessly executed. For example - using this >> approach I need only run the phrase query for "XYZ limited" whenever I >> encounter a document with the rare term "XYZ" in it, rather than the much >> more commonplace "limited". >> >> Cheers >> Mark >> >> ----- Original Message ---- >> From: karl wettin <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Friday, 23 March, 2007 12:54:36 PM >> Subject: Re: Reverse search >> >> >> 23 mar 2007 kl. 09.57 skrev Melanie Langlois: >> >> >> >>> Well, I though to use the PerFieldAnalyzerWrapper which contains as >>> basic the snowballAnalyzer with English stopwords and use >>> snowballAnalyzer with language specific keywords for the fields >>> which will be in different languages. But I'm seeing that in your >>> MemoryIndexTest you commented the use of SnowballAnalyzer, is it >>> because it's too slow. In this case, I think I could use the >>> StandardAnalyzer... what do you think? >>> >>> >> I think that creating an index with a couple of documents takes a >> fraction of the time it will take to place a million queries on that >> index. There is no real need to optimize something that takes >> milliseconds when you in the same process do something that takes >> half a minute. >> >> >> > > > > > > > ___________________________________________________________ > All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease > of use." - PC Magazine > http://uk.docs.yahoo.com/nowyoucan.html > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]