Re: Crawling web and intranet files into single crawldb

Bayu Widyasanyata Wed, 04 Jun 2014 06:34:41 -0700

OK, thanks! :)


On Wed, Jun 4, 2014 at 8:28 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> ah yes. i am wrong, do not remove it :)
>
>
>
> -----Original message-----
> From:Bayu Widyasanyata <bwidyasany...@gmail.com>
> Sent:Wed 04-06-2014 15:25
> Subject:Re: Crawling web and intranet files into single crawldb
> To:user@nutch.apache.org;
> Hi Markus,
>
> Did you mean I should remove "file://" line from prefix-urlfilter.txt?
>
> When I checked with command: bin/nutch
> org.apache.nutch.net.URLFilterChecker -allCombined < urls/seed.txt, it
> returns:
>
> Checking combination of all URLFilters available
> -http://www.myurl.com
> -file://opt/searchengine/nutch
>
> What does it mean?
>
> Following are contains of my prefix-urlfilter.txt file (default
> configuration):
>
> http://
> https://
> ftp://
> file://
>
> Even though I removed "file://" or not, the result of nutch
> URLFilterChecker is still the same.
>
> On Wed, Jun 4, 2014 at 7:50 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
>
> > Remove it from the prefix filter and confirm it works using bin/nutch
> > org.apache.nutch.net.URLFilterChecker -allCombined
> >
> >
> >
> > -----Original message-----
> > From:Bayu Widyasanyata <bwidyasany...@gmail.com>
> > Sent:Wed 04-06-2014 14:47
> > Subject:Re: Crawling web and intranet files into single crawldb
> > To:user@nutch.apache.org;
> > Hi Markus,
> >
> > The following files should I configured:
> >
> > = prefix-urlfilter.txt: put file:// which is already configured.
> > = regex-urlfilter.txt: update following line -^(file|ftp|mailto) to
> > -^(ftp|mailto):
> > = urls/seed.txt: add new URL/file path.
> >
> > ...and start crawling.
> >
> > Is it enough? CMIIW
> >
> > Thanks-
> >
> >
> >
> > On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Hi Bayu,
> > >
> > >
> > > You must enabled the protocol-file first. Then make sure the file://
> > > prefix is not filtered via prefix-urlfilter.txt or any other. Now just
> > > inject new URL's and start the crawl.
> > >
> > >
> > > Cheers
> > >
> > >
> > >
> > > -----Original message-----
> > > From:Bayu Widyasanyata <bwidyasany...@gmail.com>
> > > Sent:Wed 04-06-2014 14:30
> > > Subject:Crawling web and intranet files into single crawldb
> > > To:user@nutch.apache.org;
> > > Hi,
> > >
> > > I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
> > > sources (http protocol).
> > > And now I want add file share data sources (file protocol) into current
> > > crawldb.
> > >
> > > What is the strategy or common practices to handle this situations?
> > >
> > > Thank you.-
> > >
> > > --
> > > wassalam,
> > > [bayu]
> > >
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

Re: Crawling web and intranet files into single crawldb

Reply via email to