OK, thanks! :)
On Wed, Jun 4, 2014 at 8:28 PM, Markus Jelsma <markus.jel...@openindex.io> wrote: > ah yes. i am wrong, do not remove it :) > > > > -----Original message----- > From:Bayu Widyasanyata <bwidyasany...@gmail.com> > Sent:Wed 04-06-2014 15:25 > Subject:Re: Crawling web and intranet files into single crawldb > To:user@nutch.apache.org; > Hi Markus, > > Did you mean I should remove "file://" line from prefix-urlfilter.txt? > > When I checked with command: bin/nutch > org.apache.nutch.net.URLFilterChecker -allCombined < urls/seed.txt, it > returns: > > Checking combination of all URLFilters available > -http://www.myurl.com > -file://opt/searchengine/nutch > > What does it mean? > > Following are contains of my prefix-urlfilter.txt file (default > configuration): > > http:// > https:// > ftp:// > file:// > > Even though I removed "file://" or not, the result of nutch > URLFilterChecker is still the same. > > On Wed, Jun 4, 2014 at 7:50 PM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Remove it from the prefix filter and confirm it works using bin/nutch > > org.apache.nutch.net.URLFilterChecker -allCombined > > > > > > > > -----Original message----- > > From:Bayu Widyasanyata <bwidyasany...@gmail.com> > > Sent:Wed 04-06-2014 14:47 > > Subject:Re: Crawling web and intranet files into single crawldb > > To:user@nutch.apache.org; > > Hi Markus, > > > > The following files should I configured: > > > > = prefix-urlfilter.txt: put file:// which is already configured. > > = regex-urlfilter.txt: update following line -^(file|ftp|mailto) to > > -^(ftp|mailto): > > = urls/seed.txt: add new URL/file path. > > > > ...and start crawling. > > > > Is it enough? CMIIW > > > > Thanks- > > > > > > > > On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma < > markus.jel...@openindex.io> > > wrote: > > > > > Hi Bayu, > > > > > > > > > You must enabled the protocol-file first. Then make sure the file:// > > > prefix is not filtered via prefix-urlfilter.txt or any other. Now just > > > inject new URL's and start the crawl. > > > > > > > > > Cheers > > > > > > > > > > > > -----Original message----- > > > From:Bayu Widyasanyata <bwidyasany...@gmail.com> > > > Sent:Wed 04-06-2014 14:30 > > > Subject:Crawling web and intranet files into single crawldb > > > To:user@nutch.apache.org; > > > Hi, > > > > > > I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web > > > sources (http protocol). > > > And now I want add file share data sources (file protocol) into current > > > crawldb. > > > > > > What is the strategy or common practices to handle this situations? > > > > > > Thank you.- > > > > > > -- > > > wassalam, > > > [bayu] > > > > > > > > > > > -- > > wassalam, > > [bayu] > > > > > > -- > wassalam, > [bayu] > -- wassalam, [bayu]