Re: Recursively searching through web dirs

lewis john mcgibbney Wed, 24 Aug 2011 13:59:05 -0700

Hi Adam,

My initial thoughts are that you are correct. It is very unusual for your
files to be located on an URL in the same domain which is not referenced by
the top level or a subsequent level URL within the domain.


What I would suggest is that you have a look through your hadoop.log as well
as use some of the commans which enable you to investigate your crawldb,
segment(s) and linkdb if you've created one.

have a look at the wiki under command line options

On Wed, Aug 24, 2011 at 9:03 PM, Adam Estrada <estrada.adam.gro...@gmail.com
> wrote:

> All,
>
> I have a root domain and a couple directories deep I have some files that I
> want to index. The problem is that they are not referenced on the main page
> using a hyperlink or anything like that.
>
> http://www.geoglobaldomination.org/kml/temp/
>
> I want to be able to crawl down in to /kml/temp/ without knowing that it's
> even there. Is there a way to do this in Nutch?
>
> echo http://www.geoglobaldomination.org > urls
>
> ./nutch crawl urls -threads 10 -depth 10 -topN 20 -solr
> http://172.16.2.107:8983/solr
>
> Nothing and I suspect that it's because there is not a hyperlink on the
> main
> page.
>
> Thoughts?
> Adam
>



-- 
*Lewis*

Re: Recursively searching through web dirs

Reply via email to