-Original message-
From:Matthias Paul magethle.nu...@gmail.com
Sent: Wed 06-Jun-2012 09:47
To: user@nutch.apache.org
Subject: Linkdb empty
Hi all,
hi
I noticed that my linkdb is always empty although I use the generated
segments from the last crawl for the generation of the
-Original message-
From:chethan chethan.p...@gmail.com
Sent: Wed 06-Jun-2012 05:12
To: user@nutch.apache.org
Subject: Nutch topN selection
Hi,
hi
Does the topN threshold consider page score for the selection. If it's set
to say 10, does Nutch queue up the 10 top scoring URLs
-Original message-
From:Andy Xue andyxuey...@gmail.com
Sent: Wed 06-Jun-2012 05:04
To: user@nutch.apache.org
Subject: Behaviour of quot;urlfilter-suffixquot; plug-in when dealing with
a URL without filename extension
Hi all:
hi
Does the urlfilter-suffix plug-in prune URL
-Original message-
From:pepe3059 pepe3...@gmail.com
Sent: Wed 06-Jun-2012 02:58
To: user@nutch.apache.org
Subject: RE: threads disminution when fetching page
me again :)
at the end of fetch process, is the regex-urlfilter considered?
No. At the end of the fetch the mapper
Thanks
-Chethan
On Wed, Jun 6, 2012 at 1:34 PM, Markus Jelsma markus.jel...@openindex.iowrote:
-Original message-
From:chethan chethan.p...@gmail.com
Sent: Wed 06-Jun-2012 05:12
To: user@nutch.apache.org
Subject: Nutch topN selection
Hi,
hi
Does the topN threshold
Hi Markus:
Thanks for the reply and information provided. I did a quick test by:
1. adding urlfilter-suffix in plugin.includes property in
nutch-site.xml
2. running runtime/local/bin/nutch org.apache.nutch.net.
URLFilterChecker -filterName
org.apache.nutch.urlfilter.suffix.SuffixURLFilter
Here
-Original message-
From:Andy Xue andyxuey...@gmail.com
Sent: Wed 06-Jun-2012 11:11
To: Markus Jelsma markus.jel...@openindex.io; user@nutch.apache.org
Subject: Re: Behaviour of quot;urlfilter-suffixquot; plug-in when dealing
with a URL without filename extension
Hi Markus:
hi
Both db.ignore.internal.links and db.ignore.external.links are true in my case.
Since I crawl only one domain, I suppose setting
db.ignore.external.links to true is a good idea.
So db.ignore.internal.links should be false?
From what I understand db.ignore.external.links is a setting for the
What's the problem with having the seed page? Can you not only inject the /news
pages? Anyway, you can always filter it away later after the first fetch cycle.
-Original message-
From:Shameema Umer shem...@gmail.com
Sent: Wed 06-Jun-2012 13:02
To: user@nutch.apache.org
Subject:
I really need to fetch news from a set of domains.
But most of my domains have news links like this:
www.mydomain.com/article/ http://www.mydomain.com/news/
werwer-wefewf-wfefef-fregd/
and the page www.mydomain.com/article/ http://www.mydomain.com/news/ does
not exit. so, i m forced to give site
I have succesfully implemented NUTCH as crawler for SOLR index on
http://szukaj.ug.edu.pl http://szukaj.ug.edu.pl site. But there is some
problem with HTTP REFERER. Nutch is not sending referer header when crawling
sites.
Is it possible to order NUTCH to send referer header on request?
Hi
Nutch cannot do this by default and is tricky to make because there may not be
one unique referrer per page. What you can try is to add the referrer to
outlinks when parsing records. This outlink can be added to CrawlDatum's
MetaData which you can then later use to set the referrer. To set
Hi all.
I need to configure Nutch to crawl one specific time, for example 5 minutes or
1 hour, anybody know how to limit the crawl process to one specific time.
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
first problem coming to mind : is your regexp-urlfilter accepting those urls ?
You should also do a readseg on the crawled segment to see of those urls are
listed in the outlinks of the feeds.
Regards
RemyA
Le 6 juin 2012 à 19:14, Shameema Umer a écrit :
I have added the feed plugin to the
14 matches
Mail list logo