Did you set the agent name in the nutch configuration. I think even when crawling only the local file system the agent name still needs to be set. If not set I believe nothing is fetched and errors are thrown but you would only see this if your logging was setup for it.
Dennis Kubes jim shirreffs wrote: > I googled and googled and goolged I am trying to crawl my local file > system and can't seem to get it right. > > I use this command > > bin/mutch crawl urls -dir crawl > > My urls dir contains one file (files) that looks like this > > file:///c:/joms > > c:/joms exists > > I've modified the config file crawl-urlfilter.txt > > #-^(file|ftp|mailto|sw|swf): > -^(http|ftp|mailto|sw|swf): > > # skip everything else ..... web spaces > #-. > +.* > > > And the config file nutch-site.xml adding > > <property> > <name>plugin.includes</name> > > <value>protocol-file|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value> > > > </property> > <property> > <name>file.content.limit</name> > <value>-1</value> > </property> > </configuration> > > > And lastly I've modified regex-urlfilter.txt > #file systems > +^file:///c:/top/directory/ > -. > > # skip file: ftp: and mailto: urls > #-^(file|ftp|mailto): > -^(http|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ > > > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept anything else > +. > > > I don't get any errors but nothing gets crawled either. If anyone can > point out my mistake(s) I would greatly appreciate it. > > thanks in advance > > jim s > > > ps it would also be nice to know this email is getting into the > nutch-users mailing list > > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
