getting reports from nutch

2012-06-22 Thread kaveh minooie
hello list does anybody knows an application or whatever that could give, generate better reports (like number of page fetched, etc) than what is available in the jobtracker site?

RE: getting reports from nutch

2012-06-22 Thread Markus Jelsma
Hi, You can sue the domainstats tools to generate counts for domain, host, suffix and tld. There's also the readdb -stats tool that shows your overall statistics. NUTCH-1325 provides the same as readdb -stats but for individual hosts. Cheers -Original message- From:kaveh minooie

Re: HTTP REFERER is missing

2012-06-22 Thread SebaZ
What code do you mean? As I wrote earlier I'm not good in JAVA programming, and just using Nutch as an application: install + use, not recode. Julien Nioche-4 wrote You can write a custom scoringfilter to track the URL of the source, see the one in urlmeta for an example. It should be

Deprecated properties in nutch-default.xml

2012-06-22 Thread Matthias Paul
Hi, in nutch-site.xml the description for the property fetcher.server.min.delay still mentions fetcher.threads.per.host. Shouldn't this be replaced with fetcher.threads.per.queue? Are there any plans to remove deprecated properties from nutch-default.xml? For example generate.max.per.host and

Re: getting reports from nutch

2012-06-22 Thread Lewis John Mcgibbney
In addition to this theres also a benchmarking tool written a while back. Although this doesn't get you reports as such it enables you to gauge the efficiency of your crawls http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/tools/Benchmark.java hth On Fri, Jun 22, 2012 at

Re: Nutch as mirroring tool

2012-06-22 Thread Vlad Paunescu
Hi Lewis, Thank you very much for your quick reply. I want to get an expert opinion whether Nutch would be the appropriate tool for what I want to accomplish, or not. In my team, the opinions are a little bit divergent: some want to use Nutch for this, but at the opposite side, some recommend

Re: Deprecated properties in nutch-default.xml

2012-06-22 Thread Lewis John Mcgibbney
Hi Matthias, Can you please open a ticket for the issues you highlight. If you could provide a patch against trunk it would be great for us to review. Thanks very much Lewis On Fri, Jun 22, 2012 at 9:03 AM, Matthias Paul magethle.nu...@gmail.com wrote: Hi, in nutch-site.xml the description

RE: robots.txt, disallow: with empty string

2012-06-22 Thread Markus Jelsma
I tried debugging your problem but it doesn't seem to exist. I fixed Nutch' RobotParser test [1] but i cannot confirm URL's being disallowed if there is NO value for Disallow: in the robots.txt file. https://issues.apache.org/jira/browse/NUTCH-1408 Test with: $ bin/nutch plugin lib-http

Odd results from nutch-crawl (1.4), and request for inlink command

2012-06-22 Thread Joshua J Pavel
So, during my crawl I get entries like this in the crawl log as a result of the parsing: http://www.domain.comhttp/www.domain.com/news/articles/2012-03-11/201205101336665761902.html http://www.domain.comhttp/www.domain.com/news/articles/2012-04-24/201205101336663435768.html The fetches fail,

RE: Odd results from nutch-crawl (1.4), and request for inlink command

2012-06-22 Thread Markus Jelsma
Hi, If Nutch finds a relative URL it will be converted to absolute. This means that any URL that does not explicitly start with http:// is going to have the host prefixed. You domain.com pages produce bad URL's such as http/www. And since this is not http://, it'll end up as

Re: Deprecated properties in nutch-default.xml

2012-06-22 Thread Lewis John Mcgibbney
Excellent Matthias thank you Lewis On Fri, Jun 22, 2012 at 3:21 PM, Matthias Paul magethle.nu...@gmail.com wrote: I opened NUTCH-1409 with a patch attached. On Fri, Jun 22, 2012 at 11:37 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Matthias, Can you please open a ticket

Re: Parser choking on irregular url

2012-06-22 Thread Lewis John Mcgibbney
Excellent On Fri, Jun 22, 2012 at 12:17 AM, Markus Jelsma markus.jel...@openindex.io wrote: Hi Lewis, You got fooled by the ampersand switch on Unix terminals that sends a command to the background. The [] integers are Unix process ID's of the commands you have given. $ abc is not one

Document boosting during indexing

2012-06-22 Thread parnab kumar
Hi, I need to boost certain documents during indexing time . I am using Nutch 1.1 . I am not using SOLR . How do i do it . please help . Thanks , Parnab

RE: Document boosting during indexing

2012-06-22 Thread Markus Jelsma
I am not sure but the boost field may be available. I think it was populated with the document score but you could increase it with a custom filter or some hacking around. -Original message- From:parnab kumar parnab.2...@gmail.com Sent: Fri 22-Jun-2012 17:36 To: