RE: Nutch and SOLR integration

Markus Jelsma Wed, 05 Oct 2016 11:45:19 -0700

Hello - see inline.
Markus
 
-----Original message-----
> From:WebDawg <[email protected]>
> Sent: Wednesday 5th October 2016 15:51
> To: [email protected]
> Subject: Nutch and SOLR integration
> 
> I am new to Solr and Nutch.
> 
> I was working through the tutorials and managed to get everything
> going up to making nutch work but not throwing the results at Solr
> yet.
> 
> From everything that I have read you throw data at SOLR or SOLR cloud
> and it just does it's thing.  I have yet to get into the details w/
> SOLR yet but it seems that it is not meant to be a full solution
> stack...IE Nutch and SOLR together = accessible secure search engine?


No, Solr is a backend application. Do not expose it to the web, never, unless 
you really know what you are doing. You at least need a proxy (like Nginx as we 
do) of some kind, and of course a web application that processes requests and 
delivers results, e.g. Javascript or server side application.

> 
> For instance with SOLR it looks like I am supposed to use/build a
> front end for it?

Yes. Solr has support for Velocity. It can generate search results page from 
within Solr. But, again, you need to know what you are doing in this case.

> 
> I always assumed that these were back end components.  But that is the
> problem, I have assumed.

Yes, they are very much backend components. You need the things i mentioned 
above, but also a lot of glue, such as provisioning, data management tools, 
logs reporting programs and analytics if you want to be fancy. In any case, 
without deep knowledge of Solr and Nutch, and everything around it, search 
engines are usually kind of, bad.

> 
> Is nutch supposed to be a solution that I should script against?  I
> read guides that show how to setup and then people just say use a
> 'crontab'.

Yes, you can use a crontab (with locking) to initiate a crawl cycle every 
minute. We use elaborate scripts to manage a cycle, because we also automated 
configuration files that come from backend applications and management tools. 
But we intend to migrate to Apache Oozie, and run our crawl cycle overthere, 
because Oozie is robust and supports failover really well, but it, as most 
Hadoop components, are cumbersome to set up.

But, you don't really need this if you crawl only a single or few fixed sites.

> 
> Reading that everyone uses the power of these projects to create
> amazing things...
> 
> What is available to manage these products?  I get that SOLR indexes
> and Nutch spiders but is there something that controls Nutch in a
> smart manner or am I supposed to do this on my own via programming?

No, to my knowledge there is no such thing available. Such scripts or control 
programs are usually tied into the company's backend systems like we have.

> 
> Is there anything out there that finely controls nutch?  Is there
> anything out there that configures nutch, or multiple nutch
> instances/profiles?

No not to my knowledge, see the answer above.

>

RE: Nutch and SOLR integration

Reply via email to