Hi,

I' am evaluating different solutions for massive phrase query execution. I need 
to execute millions of greps or more precise phrase queries consisting of 1-4 
terms against millions of documents. I saw the hadoop grep example but this is 
executing grep with one regex.

I also saw the "Side data distribution" / "Distributed Cache" possibility of 
hadoop. So I could pass them to the mapper and execute each query agains the 
input line. The input line would be the entire text of an document (usually 
50-500 words). 

As I am aiming to  have these information almost in realtime another questions 
arises about adhoc map/reduce jobs. Is there a limit of running a lot of jobs 
in parallel, lets say if I would fire a new job once a new document arises. 
This job would only process that particular document. Or I would batch 100-1000 
documents and then fire the job. 

Can anyone advise an approach of doing it with hadoop?

Thanks in advance,
Oliver













Reply via email to