Re: [Nutch-general] Crawl on local site not working

Narayan, Anand Fri, 29 Sep 2006 08:32:37 -0700

I tried all kind of combinations in the crawl-urlfilter filter and all of it
does not work.


What I want to do is simple. I have a JSP based site running on my local
machine.

I want to crawl through the non-secure pages and get them indexed.

From, the documentation I gathered that I need to create a urls directory
and place a file there containing the site url.

The I need to change the crawl-urlfilter file and add the domain I want to
include.

So here is my frontend file that I placed in the urls directory 
http://172.16.10.99:7001/frontend/


In the crawl-urlfilter.txt file I changed only the line for the domain name
as follows:
+^http://172.16.10.99:7001/frontend/
I tried all combinations of the above including, placing only
http://172.16.10.99:7001/frontend/ etc. but it does not work.

Here is the exception I am getting. It does create a crawl output directory
but I am not able to find anything by searhig on it.

fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException

Any ideas?

Thanks
Anand

-----Original Message-----
From: Dima Mazmanov [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 29, 2006 11:37 AM
To: Narayan, Anand
Subject: Re: Crawl on local site not working

Hi,Anand.


You wrote 29 ÓÅÎÔÑÂÒÑ 2006 Ç., 18:22:39:

> I am new to nutch and am trying to see if we can use it for web search 
> functionality.
> I am running the site on my local box on a Weblogic server.  I am 
> using nutch 0.8.1 on Windows XP using cygwin.

> I created a "urls" directory and then created a file called "frontend" 
> in that directory The local url that I have specified in that file is 
> http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>
> This is the only line in that file.

> I have also changed the crawl-urlfilter file as follows # accept hosts 
> in MY.DOMAIN.NAME

> +^http://172.16.10.99:7001/frontend/
this is baaadd.
remove this string from file.
and then copy urls from "frented" into crawl-urlfilter file directly
after this   # accept hosts in MY.DOMAIN.NAME .

remove +. from end of file
and write -.


> The command I am executing is
> bin/nutch crawl urls -dir _crawloutput -depth 3 -topN 50

> The crawl output I get is as follows:
> crawl started in: _crawloutput
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: _crawloutput/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: starting
> Generator: segment: _crawloutput/segments/20060929101916
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: _crawloutput/segments/20060929101916
> Fetcher: threads: 10
> fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: _crawloutput/crawldb CrawlDb update: segment: 
> _crawloutput/segments/20060929101916
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: _crawloutput/segments/20060929101924
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: _crawloutput/segments/20060929101924
> Fetcher: threads: 10
> fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: _crawloutput/crawldb CrawlDb update: segment: 
> _crawloutput/segments/20060929101924
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: _crawloutput/segments/20060929101932
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: _crawloutput/segments/20060929101932
> Fetcher: threads: 10
> fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: _crawloutput/crawldb CrawlDb update: segment: 
> _crawloutput/segments/20060929101932
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: _crawloutput/linkdb
> LinkDb: adding segment: _crawloutput/segments/20060929101916
> LinkDb: adding segment: _crawloutput/segments/20060929101924
> LinkDb: adding segment: _crawloutput/segments/20060929101932
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: _crawloutput/linkdb
> Indexer: adding segment: _crawloutput/segments/20060929101916
> Indexer: adding segment: _crawloutput/segments/20060929101924
> Indexer: adding segment: _crawloutput/segments/20060929101932
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: _crawloutput/indexes
> Dedup: done
> Adding _crawloutput/indexes/part-00000 crawl finished: _crawloutput

> I am not sure what I am doing wrong. Can someone help?

> Thanks
> Anand Narayan


> __________ NOD32 1.1783 (20060929) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




--
Regards,
 Dima                          mailto:[EMAIL PROTECTED]
CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is
for the sole use of the intended recipient(s) and may contain confidential
and privileged information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended recipient, please
contact the sender by reply e-mail and destroy all copies of the original
message.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawl on local site not working

Reply via email to