Re: execute millions of "grep"

Oliver Krohne Thu, 03 Nov 2011 04:55:40 -0700

Hi Eugene,

thanks for the hint. I looked at automaton and of course 100k are not consuming 
a lot of memory and a single pass through the document is perfect. But how will 
it work with 20 million and more queries?
In addition there will be different sets of queries for each language. So we 
can expect to have 20 million + queries for each language. Of course they would 
be different automatons.


Another approach I am thinking of is to build an inverted index of the queries 
and during map I will lookup the queries for each word and during reduce do 
normal query processing. What do you think about that ?

regards,
Oliver

Am 03.11.2011 um 12:37 schrieb Eugene Kirpichov:

> Hi Oliver,
> I have solved a similar problem before, and it seems to me that the best 
> solution is to build an automaton (with whole words, not letters on edges) on 
> the set of queries (basically, a "trie") that will allow you to find all 
> queries matching a document by a single pass over the document.
> For 100k queries, the true will be very fast and very easy to build, and it 
> will not consume much memory.
> The document scan is as follows for each word:
> - add root of trie to "active nodes" set
> - for each node in the active set, replace the node with its child 
> corresponding to the scanned word.
> 
> Then whenever you add a leaf node to the active set, you are seeing a match 
> of some query that ends at the current word.
> 
> Please tell if this description needs more detail.
> 
> 03.11.2011, в 15:20, Oliver Krohne <oliver.kro...@yieldkit.com> написал(а):
> 
>> Hi Eugene,
>> 
>> thanks for the quick reply.
>> 
>> Not only the queries are changing but also the documents:
>> 
>> a)The long time goal would be able to process a document or a batch of 
>> documents after a certain event which a mean with "realtime". There will be 
>> different types of documents which are updated differently. One set 
>> documents will be updated every 1-2 hours. Other documents will be updated 
>> once a week. So an event means that the queries have to run against that set 
>> documents which are updated. So probably I need to run the queries every 
>> hour against against a subset and one a week the other documents need to be 
>> searched.
>> 
>> b)The queries are changing a couple of times a day. At this stage we have 
>> around 100k queries but the will grow every day. A huge set of the queries 
>> remain the same but I expect that after 3 month 10% of the queries will 
>> change every day. Either they are deleted or new ones are added. 
>> 
>> One of the main question am still thinking of is where to search in. Search 
>> the queries in the documents or search the document terms in the queries. 
>> 
>> I will need at first exact match of a phrase query and in a next step phrase 
>> match given a precision/slop.
>> 
>> Thanks for your help,
>> Oliver
>> 
>> 
>> Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov:
>> 
>>> If you really need to do millions of exact text queries against millions of
>>> documents in realtime, a simple grep is not going to be sufficient for you.
>>> You'll need smarter datastructures and algorithms.
>>> 
>>> Please specify how frequently the set of *queries* changes and what you
>>> consider "real time".
>>> 
>>> On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne 
>>> <oliver.kro...@yieldkit.com>wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I' am evaluating different solutions for massive phrase query execution. I
>>>> need to execute millions of greps or more precise phrase queries consisting
>>>> of 1-4 terms against millions of documents. I saw the hadoop grep example
>>>> but this is executing grep with one regex.
>>>> 
>>>> I also saw the "Side data distribution" / "Distributed Cache" possibility
>>>> of hadoop. So I could pass them to the mapper and execute each query agains
>>>> the input line. The input line would be the entire text of an document
>>>> (usually 50-500 words).
>>>> 
>>>> As I am aiming to  have these information almost in realtime another
>>>> questions arises about adhoc map/reduce jobs. Is there a limit of running a
>>>> lot of jobs in parallel, lets say if I would fire a new job once a new
>>>> document arises. This job would only process that particular document. Or I
>>>> would batch 100-1000 documents and then fire the job.
>>>> 
>>>> Can anyone advise an approach of doing it with hadoop?
>>>> 
>>>> Thanks in advance,
>>>> Oliver
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> -- 
>>> Eugene Kirpichov
>>> Principal Engineer, Mirantis Inc. http://www.mirantis.com/
>>> Editor, http://fprog.ru/
>> 
>> ---
>> 
>> Oliver Krohne
>> Founder & CTO
>> 
>> YieldKit UG (haftungsbeschränkt)
>> Mittelweg 161
>> 20148 Hamburg
>> 
>> T +49 40 209 349 771
>> F +49 40 209 349 779
>> E oliver.kro...@yieldkit.com
>> 
>> http://www.yieldkit.com
>> 
>> _____________________________________
>> Sitz der Gesellschaft: Hamburg
>> Geschäftsführer: Sandra Tiemann, Oliver Krohne
>> Handelsregister: Amtsgericht Hamburg HRB 109104
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 

---

Oliver Krohne
Founder & CTO

YieldKit UG (haftungsbeschränkt)
Mittelweg 161
20148 Hamburg

T +49 40 209 349 771
F +49 40 209 349 779
E oliver.kro...@yieldkit.com

http://www.yieldkit.com

_____________________________________
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Sandra Tiemann, Oliver Krohne
Handelsregister: Amtsgericht Hamburg HRB 109104

Re: execute millions of "grep"

Reply via email to