RE: Reverse search

Melanie Langlois Tue, 27 Mar 2007 23:38:46 -0800

Thanks, make sense. Just another question about the memoryIndex. In your 
example you said I can do memoryIndex. getReader().terms(); but in fact there 
is no public access to the reader from memory index...
If this is not possible, I will list the docs terms while I'm indexing.


Mélanie 
  
-----Original Message-----
From: markharw00d [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 28, 2007 4:34 PM
To: java-user@lucene.apache.org
Subject: Re: Reverse search

 >>I just want to make sure there is no API either

No, but your code looks like it should do the job. That code can be 
improved by something like [psuedo code]:

query.extractTerms(terms);

if(query instanceof PhraseQuery)
{      
        //find and index rarest term only using an existing index 
(corpusReader)
          int rarestDf=Integer.MAX;
          Term rarestTerm=null;
         for(Term term:terms){

                int df= corpusReader.docFreq(term);
                if(df<rarestDf)
                {
                        rarestDf=df;
                        rarestTerm=term;
                }
        }
        //add just the rarest term
        doc.add(new Field(rarestTerm.field(),rarestTerm.text(),
                Field.Store.NO,Field.Index.TOKENIZED);

}
else
{
    //add all terms
}

Melanie Langlois wrote:
> Mark,
> When, I extract the terms from my query, I can not use add them directly? I 
> have to do something like:
> Set<Term> terms=new HashSet<Term>();
> query.extractTerms(terms);
> Document doc=new Document();
> for(Term term:terms){
> doc.add(new 
> Field(term.field(),term.text(),Field.Store.NO,Field.Index.TOKENIZED);
> }
>
> I just want to make sure there is no API either create the document from term 
> or to index the term directly.
>
> Thanks
> Mélanie 
>
> -----Original Message-----
> From: markharw00d [mailto:[EMAIL PROTECTED] 
> Sent: Monday, March 26, 2007 12:36 AM
> To: java-user@lucene.apache.org
> Subject: Re: Reverse search
>
>
> On app startup:
> 1) parse all Queries and place in an array.
> 2) Create a RAMIndex containing a doc for each query with content 
> consisting of the query's terms (see Query.extractTerms). For optimal 
> performance only index the most rare term for queries with multiple 
> mandatory criteria e.g. PhraseQuerys. "Most rare" can be determined by 
> looking at IndexReader.docFreq(t) using an existing index which is 
> representative of  your type of content.
> 3) For any queries that can't be handled by 2) e.g. FuzzyQueries - add 
> to list of "run always queries".
>
> Whenever you receive a new document:
> 1) Put it in a MemoryIndex
> 2) Get a list of the document's terms by calling 
> memoryIndex.getReader().terms();
> 3) For each term hit your query RAMIndex and get 
> queryIndexReader.termDocs(term) - this will give you the ids of queries 
> that need to be run - you can use the doc id to index straight into your 
> parsed queries array.
> 4) Run all queries found in 3) and all those held in your "run always" 
> list against the MemoryIndex containing your new document
>
> Hope this helps,
> Mark
>
>
> Melanie Langlois wrote:
>   
>> Hi Mark,
>> If I follow you, I should list the key terms in my incoming document, then 
>> select the queries which contains these key terms, and then run those 
>> queries on my index ? If this is correct there is two things I don't 
>> understand:
>> -how do I know which term is a key term in my document ?
>> -how can I select the queries? Should I index them in a separate index?
>>
>> Thanks,
>>
>>
>> Mélanie Langlois 
>>   
>> -----Original Message-----
>> From: mark harwood [mailto:[EMAIL PROTECTED] 
>> Sent: Friday, March 23, 2007 11:19 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Reverse search
>>
>> Bear in mind that the million queries you run on the MemoryIndex can be 
>> shortlisted if you place those queries in a RAMIndex and use the source 
>> document's terms to "query the queries". The list of unique terms for your 
>> document is readily available in the MemoryIndex's TermEnum.
>> You can take this list and find "likely related queries" to execute from 
>> your Query index.
>> Note that for phrase queries or other forms of query with multiple mandatory 
>> terms you should only index one of the terms (preferably the rarest) to 
>> ensure that your query is not needlessly executed. For example - using this 
>> approach I need only run the phrase query for "XYZ limited" whenever I 
>> encounter a document with the rare term "XYZ" in it, rather than the much 
>> more commonplace "limited". 
>>
>> Cheers
>> Mark
>>
>> ----- Original Message ----
>> From: karl wettin <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Friday, 23 March, 2007 12:54:36 PM
>> Subject: Re: Reverse search
>>
>>
>> 23 mar 2007 kl. 09.57 skrev Melanie Langlois:
>>
>>   
>>     
>>> Well, I though to use the PerFieldAnalyzerWrapper which contains as  
>>> basic the snowballAnalyzer with English stopwords and use  
>>> snowballAnalyzer with language specific keywords for the fields  
>>> which will be in different languages. But I'm seeing that in your  
>>> MemoryIndexTest you commented the use of SnowballAnalyzer, is it  
>>> because it's too slow. In this case, I think I could use the  
>>> StandardAnalyzer... what do you think?
>>>     
>>>       
>> I think that creating an index with a couple of documents takes a  
>> fraction of the time it will take to place a million queries on that  
>> index. There is no real need to optimize something that takes  
>> milliseconds when you in the same process do something that takes  
>> half a minute.
>>
>>   
>>     
>
>
>
>       
>       
>               
> ___________________________________________________________ 
> All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease 
> of use." - PC Magazine 
> http://uk.docs.yahoo.com/nowyoucan.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Reverse search

Reply via email to