ah yes. i am wrong, do not remove it :) -----Original message----- From:Bayu Widyasanyata <bwidyasany...@gmail.com> Sent:Wed 04-06-2014 15:25 Subject:Re: Crawling web and intranet files into single crawldb To:user@nutch.apache.org; Hi Markus,
Did you mean I should remove "file://" line from prefix-urlfilter.txt? When I checked with command: bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined < urls/seed.txt, it returns: Checking combination of all URLFilters available -http://www.myurl.com -file://opt/searchengine/nutch What does it mean? Following are contains of my prefix-urlfilter.txt file (default configuration): http:// https:// ftp:// file:// Even though I removed "file://" or not, the result of nutch URLFilterChecker is still the same. On Wed, Jun 4, 2014 at 7:50 PM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Remove it from the prefix filter and confirm it works using bin/nutch > org.apache.nutch.net.URLFilterChecker -allCombined > > > > -----Original message----- > From:Bayu Widyasanyata <bwidyasany...@gmail.com> > Sent:Wed 04-06-2014 14:47 > Subject:Re: Crawling web and intranet files into single crawldb > To:user@nutch.apache.org; > Hi Markus, > > The following files should I configured: > > = prefix-urlfilter.txt: put file:// which is already configured. > = regex-urlfilter.txt: update following line -^(file|ftp|mailto) to > -^(ftp|mailto): > = urls/seed.txt: add new URL/file path. > > ...and start crawling. > > Is it enough? CMIIW > > Thanks- > > > > On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Hi Bayu, > > > > > > You must enabled the protocol-file first. Then make sure the file:// > > prefix is not filtered via prefix-urlfilter.txt or any other. Now just > > inject new URL's and start the crawl. > > > > > > Cheers > > > > > > > > -----Original message----- > > From:Bayu Widyasanyata <bwidyasany...@gmail.com> > > Sent:Wed 04-06-2014 14:30 > > Subject:Crawling web and intranet files into single crawldb > > To:user@nutch.apache.org; > > Hi, > > > > I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web > > sources (http protocol). > > And now I want add file share data sources (file protocol) into current > > crawldb. > > > > What is the strategy or common practices to handle this situations? > > > > Thank you.- > > > > -- > > wassalam, > > [bayu] > > > > > > -- > wassalam, > [bayu] > -- wassalam, [bayu]