RE: Reverse search

Melanie Langlois Sat, 24 Mar 2007 23:34:34 -0800

Hi Mark,
If I follow you, I should list the key terms in my incoming document, then 
select the queries which contains these key terms, and then run those queries 
on my index ? If this is correct there is two things I don't understand:
-how do I know which term is a key term in my document ?
-how can I select the queries? Should I index them in a separate index?


Thanks,


Mélanie Langlois 
  
-----Original Message-----
From: mark harwood [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 23, 2007 11:19 PM
To: [email protected]
Subject: Re: Reverse search

Bear in mind that the million queries you run on the MemoryIndex can be 
shortlisted if you place those queries in a RAMIndex and use the source 
document's terms to "query the queries". The list of unique terms for your 
document is readily available in the MemoryIndex's TermEnum.
You can take this list and find "likely related queries" to execute from your 
Query index.
Note that for phrase queries or other forms of query with multiple mandatory 
terms you should only index one of the terms (preferably the rarest) to ensure 
that your query is not needlessly executed. For example - using this approach I 
need only run the phrase query for "XYZ limited" whenever I encounter a 
document with the rare term "XYZ" in it, rather than the much more commonplace 
"limited". 

Cheers
Mark

----- Original Message ----
From: karl wettin <[EMAIL PROTECTED]>
To: [email protected]
Sent: Friday, 23 March, 2007 12:54:36 PM
Subject: Re: Reverse search


23 mar 2007 kl. 09.57 skrev Melanie Langlois:

> Well, I though to use the PerFieldAnalyzerWrapper which contains as  
> basic the snowballAnalyzer with English stopwords and use  
> snowballAnalyzer with language specific keywords for the fields  
> which will be in different languages. But I'm seeing that in your  
> MemoryIndexTest you commented the use of SnowballAnalyzer, is it  
> because it's too slow. In this case, I think I could use the  
> StandardAnalyzer... what do you think?

I think that creating an index with a couple of documents takes a  
fraction of the time it will take to place a million queries on that  
index. There is no real need to optimize something that takes  
milliseconds when you in the same process do something that takes  
half a minute.

-- 
karl

>
> Mélanie
>
> -----Original Message-----
> From: karl wettin [mailto:[EMAIL PROTECTED]
> Sent: Friday, March 23, 2007 12:46 PM
> To: [email protected]
> Subject: Re: Reverse search
>
>
> 23 mar 2007 kl. 03.07 skrev Melanie Langlois:
>
>> Thanks Karl, the performances graph is really amazing!
>> I have to say that it would not have think this way around would be
>> faster, but sounds nice if I can use this, make everything easier
>> to manage. I'm just wondering what did you consider when you build
>> your graph, only the time to run the queries? Because, I should add
>> the time for creating the index anytime a new document comes in (or
>> a subset of documents if several comes in same time), and the
>> indexing of these documents. The documents should not be big,
>> around 2KB. Did you measure this part ?
>
> Adding a document to a MemoryIndex or InstantiatedIndex takes more or
> less the same time it would take to add it to an empty RAMDirectory.
> How many clock ticks is spent really depends on what analysers you  
> use.
>
> -- 
> karl
>
>>
>> Mélanie
>>
>> -----Original Message-----
>> From: karl wettin [mailto:[EMAIL PROTECTED]
>> Sent: Friday, March 23, 2007 10:35 AM
>> To: [email protected]
>> Subject: Re: Reverse search
>>
>>
>> 23 mar 2007 kl. 02.12 skrev Melanie Langlois:
>>
>>> I want to manage user subscriptions to specific documents. So I
>>> would like to store the subscription (query) into the lucene
>>> directory, and whenever I receive a new document, I will search all
>>> the matching subscriptions to send the documents to all subcribers.
>>> For instance if a user subscribes to all documents with text
>>> containing (WORD1 and WORD2) or WORD3, how can I match the incoming
>>> document based on stored subscriptions? I was thinking to have two
>>> subfields for each field of the subscription: the AND conditions
>>> and the OR conditions.
>>>
>>> -OR. I will tokenized the document field content and insert OR
>>> between each of them, and run the query against OR condition of
>>> subscription
>>>
>>> -It's for the AND that I will have an issue, because if the
>>> incoming text may contains more words than the sequence I want to
>>> search.
>>>
>>> For instance, if I subscribe for documents contents lucene and java
>>> for instance , if the incoming document contents is lucene is a
>>> great API which has been developed in java, once I removed
>>> stopwords my query would look like lucene and great and API and
>>> developed and java.
>>>
>>> As query is composed of more words than the stored subscription I
>>> will fail to retrieve the subscription. But if I put only or words,
>>> the results will not be accurate, as I can obtain subscription only
>>> for java for instance.
>>>
>>
>> I wrote such a thing way back, where I used the new document as the
>> query and the user subscriptions as the index. Similar to what you
>> describe, I had an AND, OR and NOT field. This really limited the
>> type of queries users could store. It does however work, particullary
>> well on systems with /huge/ amounts of subscriptions (many millions).
>>
>> Today I would have used something else. If you insert one document at
>> the time to your index, take a look at MemoryIndex in contrib. If you
>> insert documents in batches larger than one document at the time,
>> take a look at LUCENE-550 in the Jira. Add new documents to such an
>> index and place the subscribed queries on it. Depening on the
>> queries, the speed should be some 20-100 times faster than using a
>> RAMDirectory. One million queries should take some 20 seconds to
>> assemble and place on a 25 document index on my laptop. See <https://
>> issues.apache.org/jira/secure/attachment/
>> 12353601/12353601_HitCollectionBench.jpg> for performace of
>> LUCENE-550.
>>
>> -- 
>> karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






                
___________________________________________________________ 
What kind of emailer are you? Find out today - get a free analysis of your 
email personality. Take the quiz at the Yahoo! Mail Championship. 
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Reverse search

Reply via email to