hello list
does anybody knows an application or whatever that could give,
generate better reports (like number of page fetched, etc) than what is
available in the jobtracker site?
Hi,
You can sue the domainstats tools to generate counts for domain, host, suffix
and tld. There's also the readdb -stats tool that shows your overall
statistics. NUTCH-1325 provides the same as readdb -stats but for individual
hosts.
Cheers
-Original message-
From:kaveh minooie
What code do you mean? As I wrote earlier I'm not good in JAVA programming,
and just using Nutch as an application: install + use, not recode.
Julien Nioche-4 wrote
You can write a custom scoringfilter to track the URL of the source,
see
the one in urlmeta for an example. It should be
Hi,
in nutch-site.xml the description for the property
fetcher.server.min.delay still mentions fetcher.threads.per.host.
Shouldn't this be replaced with fetcher.threads.per.queue?
Are there any plans to remove deprecated properties from nutch-default.xml?
For example generate.max.per.host and
In addition to this theres also a benchmarking tool written a while back.
Although this doesn't get you reports as such it enables you to gauge
the efficiency of your crawls
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/tools/Benchmark.java
hth
On Fri, Jun 22, 2012 at
Hi Lewis,
Thank you very much for your quick reply.
I want to get an expert opinion whether Nutch would be the appropriate tool
for what I want to accomplish, or not. In my team, the opinions are a
little bit divergent: some want to use Nutch for this, but at the opposite
side, some recommend
Hi Matthias,
Can you please open a ticket for the issues you highlight.
If you could provide a patch against trunk it would be great for us to
review. Thanks very much
Lewis
On Fri, Jun 22, 2012 at 9:03 AM, Matthias Paul magethle.nu...@gmail.com wrote:
Hi,
in nutch-site.xml the description
I tried debugging your problem but it doesn't seem to exist. I fixed Nutch'
RobotParser test [1] but i cannot confirm URL's being disallowed if there is NO
value for Disallow: in the robots.txt file.
https://issues.apache.org/jira/browse/NUTCH-1408
Test with:
$ bin/nutch plugin lib-http
So, during my crawl I get entries like this in the crawl log as a result of
the parsing:
http://www.domain.comhttp/www.domain.com/news/articles/2012-03-11/201205101336665761902.html
http://www.domain.comhttp/www.domain.com/news/articles/2012-04-24/201205101336663435768.html
The fetches fail,
Hi,
If Nutch finds a relative URL it will be converted to absolute. This means that
any URL that does not explicitly start with http:// is going to have the host
prefixed. You domain.com pages produce bad URL's such as http/www. And since
this is not http://, it'll end up as
Excellent Matthias thank you
Lewis
On Fri, Jun 22, 2012 at 3:21 PM, Matthias Paul magethle.nu...@gmail.com wrote:
I opened NUTCH-1409 with a patch attached.
On Fri, Jun 22, 2012 at 11:37 AM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi Matthias,
Can you please open a ticket
Excellent
On Fri, Jun 22, 2012 at 12:17 AM, Markus Jelsma
markus.jel...@openindex.io wrote:
Hi Lewis,
You got fooled by the ampersand switch on Unix terminals that sends a command
to the background. The [] integers are Unix process ID's of the commands you
have given.
$ abc is not one
Hi,
I need to boost certain documents during indexing time . I am using Nutch
1.1 . I am not using SOLR . How do i do it . please help .
Thanks ,
Parnab
I am not sure but the boost field may be available. I think it was populated
with the document score but you could increase it with a custom filter or some
hacking around.
-Original message-
From:parnab kumar parnab.2...@gmail.com
Sent: Fri 22-Jun-2012 17:36
To:
14 matches
Mail list logo