RE: Linkdb empty

2012-06-06 Thread Markus Jelsma
-Original message- From:Matthias Paul magethle.nu...@gmail.com Sent: Wed 06-Jun-2012 09:47 To: user@nutch.apache.org Subject: Linkdb empty Hi all, hi I noticed that my linkdb is always empty although I use the generated segments from the last crawl for the generation of the

RE: Nutch topN selection

2012-06-06 Thread Markus Jelsma
-Original message- From:chethan chethan.p...@gmail.com Sent: Wed 06-Jun-2012 05:12 To: user@nutch.apache.org Subject: Nutch topN selection Hi, hi Does the topN threshold consider page score for the selection. If it's set to say 10, does Nutch queue up the 10 top scoring URLs

RE: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-06 Thread Markus Jelsma
-Original message- From:Andy Xue andyxuey...@gmail.com Sent: Wed 06-Jun-2012 05:04 To: user@nutch.apache.org Subject: Behaviour of quot;urlfilter-suffixquot; plug-in when dealing with a URL without filename extension Hi all: hi Does the urlfilter-suffix plug-in prune URL

RE: threads disminution when fetching page

2012-06-06 Thread Markus Jelsma
-Original message- From:pepe3059 pepe3...@gmail.com Sent: Wed 06-Jun-2012 02:58 To: user@nutch.apache.org Subject: RE: threads disminution when fetching page me again :) at the end of fetch process, is the regex-urlfilter considered? No. At the end of the fetch the mapper

Re: Nutch topN selection

2012-06-06 Thread chethan
Thanks -Chethan On Wed, Jun 6, 2012 at 1:34 PM, Markus Jelsma markus.jel...@openindex.iowrote: -Original message- From:chethan chethan.p...@gmail.com Sent: Wed 06-Jun-2012 05:12 To: user@nutch.apache.org Subject: Nutch topN selection Hi, hi Does the topN threshold

Re: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-06 Thread Andy Xue
Hi Markus: Thanks for the reply and information provided. I did a quick test by: 1. adding urlfilter-suffix in plugin.includes property in nutch-site.xml 2. running runtime/local/bin/nutch org.apache.nutch.net. URLFilterChecker -filterName org.apache.nutch.urlfilter.suffix.SuffixURLFilter Here

RE: Behaviour of urlfilter-suffix plug-in when dealing with a URL without filename extension

2012-06-06 Thread Markus Jelsma
-Original message- From:Andy Xue andyxuey...@gmail.com Sent: Wed 06-Jun-2012 11:11 To: Markus Jelsma markus.jel...@openindex.io; user@nutch.apache.org Subject: Re: Behaviour of quot;urlfilter-suffixquot; plug-in when dealing with a URL without filename extension Hi Markus: hi

Re: Linkdb empty

2012-06-06 Thread Matthias Paul
Both db.ignore.internal.links and db.ignore.external.links are true in my case. Since I crawl only one domain, I suppose setting db.ignore.external.links to true is a good idea. So db.ignore.internal.links should be false? From what I understand db.ignore.external.links is a setting for the

RE: How to write complex rules on regex-urlfilter

2012-06-06 Thread Markus Jelsma
What's the problem with having the seed page? Can you not only inject the /news pages? Anyway, you can always filter it away later after the first fetch cycle. -Original message- From:Shameema Umer shem...@gmail.com Sent: Wed 06-Jun-2012 13:02 To: user@nutch.apache.org Subject:

Re: How to write complex rules on regex-urlfilter

2012-06-06 Thread Shameema Umer
I really need to fetch news from a set of domains. But most of my domains have news links like this: www.mydomain.com/article/ http://www.mydomain.com/news/ werwer-wefewf-wfefef-fregd/ and the page www.mydomain.com/article/ http://www.mydomain.com/news/ does not exit. so, i m forced to give site

HTTP REFERER is missing

2012-06-06 Thread SebaZ
I have succesfully implemented NUTCH as crawler for SOLR index on http://szukaj.ug.edu.pl http://szukaj.ug.edu.pl site. But there is some problem with HTTP REFERER. Nutch is not sending referer header when crawling sites. Is it possible to order NUTCH to send referer header on request?

RE: HTTP REFERER is missing

2012-06-06 Thread Markus Jelsma
Hi Nutch cannot do this by default and is tricky to make because there may not be one unique referrer per page. What you can try is to add the referrer to outlinks when parsing records. This outlink can be added to CrawlDatum's MetaData which you can then later use to set the referrer. To set

how to crawl a specific time

2012-06-06 Thread Ing. Eyeris Rodriguez Rueda
Hi all. I need to configure Nutch to crawl one specific time, for example 5 minutes or 1 hour, anybody know how to limit the crawl process to one specific time. 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

Re: can nutch crawl links in rss feed?

2012-06-06 Thread Rémy Amouroux
first problem coming to mind : is your regexp-urlfilter accepting those urls ? You should also do a readseg on the crawled segment to see of those urls are listed in the outlinks of the feeds. Regards RemyA Le 6 juin 2012 à 19:14, Shameema Umer a écrit : I have added the feed plugin to the