[jira] Reopened: (NUTCH-309) Uses commons logging Code Guards
[ http://issues.apache.org/jira/browse/NUTCH-309?page=all ] Doug Cutting reopened NUTCH-309: I am re-opening this issue, as the guards were added in far too many places. Jerome, can you please fix these so that guards are only added when (a) the log level is DEBUG or TRACE, (b) it occurs in performance-critical code, and (c) the logged string is not constant. Uses commons logging Code Guards Key: NUTCH-309 URL: http://issues.apache.org/jira/browse/NUTCH-309 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Priority: Minor Fix For: 0.8-dev Code guards are typically used to guard code that only needs to execute in support of logging, that otherwise introduces undesirable runtime overhead in the general case (logging disabled). Examples are multiple parameters, or expressions (e.g. string + more) for parameters. Use the guard methods of the form log.isPriority() to verify that logging should be performed, before incurring the overhead of the logging method call. Yes, the logging methods will perform the same check, but only after resolving parameters. (description extracted from http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-309) Uses commons logging Code Guards
[ http://issues.apache.org/jira/browse/NUTCH-309?page=comments#action_12419670 ] Jerome Charron commented on NUTCH-309: -- As already discussed, it perfectly makes sense and I have planned to work on this issue. Another minor change I would like to make is to replace the log4j.properties by log4j.xml : The log4j.xml provides more funtionality and flexibility : especially filters that provide a way to log to different appenders depending on the log level for instance (for instance I use this to log all levels to a file and warn and error level to the console). Uses commons logging Code Guards Key: NUTCH-309 URL: http://issues.apache.org/jira/browse/NUTCH-309 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Jerome Charron Assignee: Jerome Charron Priority: Minor Fix For: 0.8-dev Code guards are typically used to guard code that only needs to execute in support of logging, that otherwise introduces undesirable runtime overhead in the general case (logging disabled). Examples are multiple parameters, or expressions (e.g. string + more) for parameters. Use the guard methods of the form log.isPriority() to verify that logging should be performed, before incurring the overhead of the logging method call. Yes, the logging methods will perform the same check, but only after resolving parameters. (description extracted from http://jakarta.apache.org/commons/logging/guide.html#Code_Guards) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-300) Clustering API improvements
[ http://issues.apache.org/jira/browse/NUTCH-300?page=comments#action_12419675 ] nutch.newbie commented on NUTCH-300: Has anyone tried this patch? Anyone from the carrot team? Is it compatible with current version? Clustering API improvements --- Key: NUTCH-300 URL: http://issues.apache.org/jira/browse/NUTCH-300 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Priority: Minor Attachments: patch.txt This patch adds support for retrieving original document scores (from NutchBean), as well as cluster-level relevance scores (from Clusterer). Both methods may improve visual representation of the clusters, where individual items may be visually differentiated depending on their query relevance and cluster relevance. A modified cluster.jsp illustrates this feature. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-300) Clustering API improvements
[ http://issues.apache.org/jira/browse/NUTCH-300?page=comments#action_12419708 ] Dawid Weiss commented on NUTCH-300: --- Hi. I just took a look at it -- I don't see anything wrong with the code and Andrzej has used Carrot2 before. We're under major refactorings to simplify things within Carrot2 -- the internals won't change much, but we drop obsolete APIs etc. The new web application has a new shiny user interface (at the moment XSLT-filtered from XMLs, so not applicable for huge user loads, but very convenient to work with on customizations). Stay tuned. Clustering API improvements --- Key: NUTCH-300 URL: http://issues.apache.org/jira/browse/NUTCH-300 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Priority: Minor Attachments: patch.txt This patch adds support for retrieving original document scores (from NutchBean), as well as cluster-level relevance scores (from Clusterer). Both methods may improve visual representation of the clusters, where individual items may be visually differentiated depending on their query relevance and cluster relevance. A modified cluster.jsp illustrates this feature. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Error with Hadoop-0.4.0
Hi Jérôme, I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. We should fix that. Stefan On 06.07.2006, at 08:54, Jérôme Charron wrote: Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK 1.5 (more precisely since HADOOP-129 and File replacement by Path). In my environment, the crawl command terminate with the following error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273)) - Input directory /localpath/crawl/crawldb/current in local is invalid. Exception in thread main java.io.IOException: Input directory /localpathcrawl/crawldb/current in local is invalid. at org.apache.hadoop.mapred.JobClient.submitJob (JobClient.java:274) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 327) at org.apache.nutch.crawl.Injector.inject(Injector.java:146) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) By looking at the Nutch code, and simply changing the line 145 of Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) all is working fine. By taking a closer look at CrawlDb code, I finaly dont understand why there is the following line in the createJob method: job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); For curiosity, if a hadoop guru can explain why there is such a regression... Does somebody have the same error? Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Number of pages different to Indexed documents
Hi all! I have a little doubt. My WebDB contains, actually, 779 pages with 899 links. When I use the segread command it returns 779 count pages too in one segment. However when I make a search or when I use the luke software the maximum number of documents is 437. I've seen the recrawl logs and when the script is fetching pages, some of them contains the message: ... failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. I thing that it happens because some network problem. The fetcher try to fetch some page, but it did not obtain. Because this, when the segment is being indexed, only the fetched pages will appear in results. It is a problem to me. Could someone explain me what should I do to refetch these pages to increase my web search results? Should I change the http.max.delays and fetcher.server.delay properties in nutch-default.xml? Regards, -- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
RE: 0.8 release
May I suggest someone take a look at NUTCH-266 before releaseing 0.8? Nutch build as of half a month ago was not working for me and another person. -kuro -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: 2006-7-05 11:53 To: nutch-dev@lucene.apache.org Subject: Re: 0.8 release +1, but I really would love to see NUTCH-293 as part of nutch .8 since this all about being more polite. Thanks. Stefan On 05.07.2006, at 03:46, Doug Cutting wrote: +1 Piotr Kosiorowski wrote: +1. P. Andrzej Bialecki wrote: Sami Siren wrote: How would folks feel about releasing 0.8 now, there has been quite a lot of improvements/new features since 0.7 series and I strongly feel that we should push the first 0.8 series release (alfa/beta) out the door now. It would IMO lower the barrier to first timers try the 0.8 series and that would give us more feedback about the overall quality. Definitely +1. Let's do some testing, however, after the upgrade to hadoop 0.3.2 - hadoop had many, many changes, so we just need to make sure it's stable when used with Nutch ... We should also check JIRA and apply any trivial fixes before the release. If there is a consensus about this I can volunteer to be the RM. That would be great, thanks!
Re: Error with Hadoop-0.4.0
I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. Thanks for this feedback Stefan. We should fix that. What I suggest, is simply to remove the line 75 in createJob method from CrawlDb : setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); In fact, this method is only used by Injector.inject() and CrawlDb.update() and the inputPath setted in createJob is not needed neither by Injector.inject() nor CrawlDb.update() methods. If no objection, I will commit this change tomorrow. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Error with Hadoop-0.4.0
We tried your suggested fix: Injector by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath (tempDir)) and this worked without any problem. Thanks for catching that, this saved us a lot of time. Stefan On 07.07.2006, at 16:08, Jérôme Charron wrote: I have the same problem on a distribute environment! :-( So I think can confirm this is a bug. Thanks for this feedback Stefan. We should fix that. What I suggest, is simply to remove the line 75 in createJob method from CrawlDb : setInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME)); In fact, this method is only used by Injector.inject() and CrawlDb.update() and the inputPath setted in createJob is not needed neither by Injector.inject() nor CrawlDb.update() methods. If no objection, I will commit this change tomorrow. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/