Hi Sachin,

Just a suggestion here - you can use Apache Kafka to generate and catch
events which are mapped to incoming crawl requests, crawl status and much
more.

I have created a prototype for production queue [0] which runs on top of a
supercomputer (TACC Wrangler) and integrated it with Kafka. Please have a
look and let me know if you have any questions.

[0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler

P.S. - There can be many solutions to this. I am just giving one.  :)

Regards,
Karanjeet Singh
http://irds.usc.edu

On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju <sachi...@mstack.com> wrote:

> Hi,
>    I was experimenting some crawl cycles with nutch and would like to setup
> a distributed crawl environment. But I wonder how can I trigger nutch for
> incoming crawl requests in a production system. I read about nutch REST
> api. Is that the real option that I have ? Or can I run nutch as a
> continuously running distributed server by any other option ?
>
>      My preferred nutch version is nutch 1.12.
>
> Regards,
> Sachin Shaju
>
> sachi...@mstack.com
> +919539887554
>
> --
>
>
> The information contained in this electronic message and any attachments to
> this message are intended for the exclusive use of the addressee(s) and may
> contain proprietary, confidential or privileged information. If you are not
> the intended recipient, you should not disseminate, distribute or copy this
> e-mail. Please notify the sender immediately and destroy all copies of this
> message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient
> should check this email and any attachments for the presence of viruses.
> The company accepts no liability for any damage caused by any virus
> transmitted by this email.
>
> www.mStack.com
>

ᐧ

Reply via email to