RE: Crawling web and intranet files into single crawldb

Markus Jelsma Wed, 04 Jun 2014 06:29:46 -0700

ah yes. i am wrong, do not remove it :)

 
 
-----Original message-----
From:Bayu Widyasanyata <bwidyasany...@gmail.com>
Sent:Wed 04-06-2014 15:25
Subject:Re: Crawling web and intranet files into single crawldb
To:user@nutch.apache.org; 
Hi Markus,


Did you mean I should remove "file://" line from prefix-urlfilter.txt?

When I checked with command: bin/nutch
org.apache.nutch.net.URLFilterChecker -allCombined < urls/seed.txt, it
returns:

Checking combination of all URLFilters available
-http://www.myurl.com
-file://opt/searchengine/nutch

What does it mean?

Following are contains of my prefix-urlfilter.txt file (default
configuration):

http://
https://
ftp://
file://

Even though I removed "file://" or not, the result of nutch
URLFilterChecker is still the same.

On Wed, Jun 4, 2014 at 7:50 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Remove it from the prefix filter and confirm it works using bin/nutch
> org.apache.nutch.net.URLFilterChecker -allCombined
>
>
>
> -----Original message-----
> From:Bayu Widyasanyata <bwidyasany...@gmail.com>
> Sent:Wed 04-06-2014 14:47
> Subject:Re: Crawling web and intranet files into single crawldb
> To:user@nutch.apache.org;
> Hi Markus,
>
> The following files should I configured:
>
> = prefix-urlfilter.txt: put file:// which is already configured.
> = regex-urlfilter.txt: update following line -^(file|ftp|mailto) to
> -^(ftp|mailto):
> = urls/seed.txt: add new URL/file path.
>
> ...and start crawling.
>
> Is it enough? CMIIW
>
> Thanks-
>
>
>
> On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
>
> > Hi Bayu,
> >
> >
> > You must enabled the protocol-file first. Then make sure the file://
> > prefix is not filtered via prefix-urlfilter.txt or any other. Now just
> > inject new URL's and start the crawl.
> >
> >
> > Cheers
> >
> >
> >
> > -----Original message-----
> > From:Bayu Widyasanyata <bwidyasany...@gmail.com>
> > Sent:Wed 04-06-2014 14:30
> > Subject:Crawling web and intranet files into single crawldb
> > To:user@nutch.apache.org;
> > Hi,
> >
> > I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
> > sources (http protocol).
> > And now I want add file share data sources (file protocol) into current
> > crawldb.
> >
> > What is the strategy or common practices to handle this situations?
> >
> > Thank you.-
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

RE: Crawling web and intranet files into single crawldb

Reply via email to