[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373256 ] Shawn Gervais commented on NUTCH-240: - This change seems to have caused an error to be thrown: 060405 034711 Generator: Partitioning selected urls by host, for politeness. Exception in thread main java.lang.RuntimeException: class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:262) at org.apache.hadoop.mapred.JobConf.setMapperClass(JobConf.java:249) at org.apache.nutch.crawl.Generator.generate(Generator.java:263) at org.apache.nutch.crawl.Generator.main(Generator.java:317) Just FYI. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373264 ] Andrzej Bialecki commented on NUTCH-240: - Oops, sorry, that was a last moment change ... I fixed it now, thanks for spotting this. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite
[ http://issues.apache.org/jira/browse/NUTCH-244?page=comments#action_12373393 ] Jerome Charron commented on NUTCH-244: -- While taking a quick look at this, something astonished me in the code. The db.max.outlinks.per.page property is exclusively used in ParseData. In the ParseData, the number of outlinks used is filtered in the readFields method ... Shouldn't it be directly filtered in the ParseData constructor ? Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite Key: NUTCH-244 URL: http://issues.apache.org/jira/browse/NUTCH-244 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: AJ Banck Some properties like file.content.limit support using negative numbers (-1) to 'disable' a limitation. Other properties do not support this. I tried disabling the limit set by db.max.outlinks.per.page, but this isn't possible. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-244) Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite
[ http://issues.apache.org/jira/browse/NUTCH-244?page=comments#action_12373398 ] Jerome Charron commented on NUTCH-244: -- That perfectly makes sense! Thanks Andrzej. Inconsistent handling of property values boundaries / unable to set db.max.outlinks.per.page to infinite Key: NUTCH-244 URL: http://issues.apache.org/jira/browse/NUTCH-244 Project: Nutch Type: Bug Versions: 0.8-dev Reporter: AJ Banck Some properties like file.content.limit support using negative numbers (-1) to 'disable' a limitation. Other properties do not support this. I tried disabling the limit set by db.max.outlinks.per.page, but this isn't possible. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Patch to fix Redirects
Dennis Kubes wrote: Attached is a patch to fix redirects. In the current version of 0.8-dev the redirect functionality wasn't working because it was using the original key value (original url) to get the output instead of the refresh url. This is the first patch that I have submitted so if this needs to be submitted differently please let me know. Fixed. Thank you! Please note that this does NOT fix the content-level redirects (i.e. meta-refresh), they are still broken. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Search quality evaluation
FYI, Mike wrote some evaluation stuff for Nutch a long time ago. I found it in the Sourceforge Attic: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/Attic/ This worked by querying a set of search engines, those in: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/engines/ The results of each engine is scored by how much they differ from all of the other engines combined. The Kendall Tau distance is used to compare rankings. Thus this is a good tool to find out how close Nutch is to the quality of other engines, but it may not not be a good tool to make Nutch better than other search engines. In any case, it includes a system to scrape search results from other engines, based on Apple's Sherlock search-engine descriptors. These descriptors are also used by Mozilla: http://mycroft.mozdev.org/deepdocs/quickstart.html So there's a ready supply of up-to-date descriptions for most major search engines. Many engines provide a skin specifically to simplify parsing by these plugins. The code that implemented Sherlock plugins in Nutch is at: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/quality/dynamic/ Doug Andrzej Bialecki wrote: Hi, I found this paper, more or less by accident: Scaling IR-System Evaluation using Term Relevance Sets; Einat Amitay, David Carmel, Ronny Lempel, Aya Soffer http://einat.webir.org/SIGIR_2004_Trels_p10-amitay.pdf It gives an interesting and rather simple framework for evaluating the quality of search results. Anybody interested in hacking together a component for Nutch and e.g. for Google, to run this evaluation? ;)
Re: Search quality evaluation
In any case, it includes a system to scrape search results from other engines, based on Apple's Sherlock search-engine descriptors. These descriptors are also used by Mozilla: Just a note: we used to have exactly the same mechanism in Carrot2. Unfortunately this format does not make a clear distinction between title/ url/ snippet parts and stays at snippet granularity, so we additionally parsed each snippet with regular expressions... The problem that lies beneath is in terms-of-use which forbid automatic scraping of search results using these plugins... That's the main reason why we switched to public APIs, actually. D.
Re: Add .settings to svn:ignore on root Nutch folder?
One can presumably disable such minor warnings in Eclipse. Arguably the bug is that Eclipse warns about such things by default, rather than in a 'pedantic' mode. I agree -- some of them are really annoying. Plus, Eclipse has been having notorious problems showing warnings for unused parameters in overriden methods... But I still think some of the warnings can be valuable and your idea with PMD is a good one. One caution: we have run into problems where includes were removed because a tool said they were unused, but they were required for the Javadoc. So code-analysis tools are not infallible! Eclipse deals with these properly -- I use it all the time. I believe it also shows warnings for classes referenced in JavaDocs and not imported. I would not be opposed to integrating PMD or something similar into Nutch's build.xml. What do others think? Any volunteers? I'll do it. I meant to see PMD anyway so it'll be a good exercise. D.
Re: Add .settings to svn:ignore on root Nutch folder?
PMD looks like a useful such tool: http://pmd.sourceforge.net/ant-task.html I would not be opposed to integrating PMD or something similar into Nutch's build.xml. What do others think? Any volunteers? +1 (Very configurable, very good tool!)
Re: Add .settings to svn:ignore on root Nutch folder?
I'm a fan of automated testing and code analysis utilities, but I must say they only make sense if people actually use them and look at their results. So it's not really just about integration -- it's about looking at the results of these tools. PMD is neat because it can simply interrupt your build process so you'll have to either fix the warning or explicitly mark it as ignored. With code coverage... I don't know. It's up to you guys -- you spend much more time on Nutch code than I do and you know best what is needed and what isn't. Let me know about PMD. I'll create the patch tomorrow if there's a consensus on if and how we should use it. For those impatient, the patch is in the attachment. Place the required PMD JARs in lib/pmd-ext/ and run 'ant pmd'. D. Jérôme Charron wrote: I would not be opposed to integrating PMD or something similar into Nutch's build.xml. What do others think? Any volunteers? I'll do it. I meant to see PMD anyway so it'll be a good exercise. Dawid, what about integrating a Code Coverage Tool like EMMA ( http://emma.sourceforge.net/) while integrating PMD ? Jérôme Index: build.xml === --- build.xml (revision 391739) +++ build.xml (working copy) @@ -198,6 +198,34 @@ /target !-- == -- + !-- Run code checks (PMD) -- + !-- == -- + target name=pmd + property name=pmd.report location=${build.dir}/pmd-report.html / + taskdef name=pmd classname=net.sourceforge.pmd.ant.PMDTask + classpath + fileset dir=${lib.dir} includes=pmd-ext/*.jar / + /classpath + /taskdef + pmd shortFilenames=true failonerror=true failOnRuleViolation=false +encoding=${build.encoding} failuresPropertyName=pmd.failures + rulesetunusedcode/ruleset + formatter type=html toFile=${pmd.report} / + !-- formatter type=xml toFile=${tempbuild}/$report_pmd.xml/ -- + fileset dir=${src.dir} +include name=**/*.java/ + !-- Exclude generated sources -- + exclude name=**/NutchAnalysis.java / + exclude name=**/NutchAnalysisTokenManager.java / + /fileset +/pmd + condition property=pmd.stop value=true + equals arg1=0 arg2=${pmd.failures} trim=true / + /condition + fail unless=pmd.stopFAILURE: PMD shows ${pmd.failures} rule violations. See ${pmd.report} for details./fail + /target + + !-- == -- !-- Run unit tests -- !-- == -- target name=test depends=test-core, test-plugins/
Re: Add .settings to svn:ignore on root Nutch folder?
Other options (raised on the Hadoop list) are Checkstyle: http://checkstyle.sourceforge.net/ and FindBugs: http://findbugs.sourceforge.net/ Although these are both under LGPL and thus harder to include in Apache projects. Anything that generates a lot of false positives is bad: it either causes us to skip analysis of lots of files, or ignore the warnings. Skipping the JavaCC-generated classes is reasonable, but I'm wary of skipping much else. Sigh. Doug Dawid Weiss wrote: Ok, PMD seems like a good idea. I've added it to the build file. Unused code detection shows a few catches (javacc-generated classes need to be ignored because they contain a lot of junk), but unfortunately it also displays false positives such as in: MapWritable.java 429 {Avoid unused private fields such as 'fKeyClassId'} This field is private but is used in an outside class (through a synthetic accessor I presume, so a simple syntax tree analysis PMD does is insufficient to catch it). These things would need to be marked in the code as ignorable... Do you want me to create a JIRA issue for this, Doug? Or should we drop the subject? Oh, I forgot to say this: PMD's jars add a minimum of 1MB to the codebase (Xerces can be reused). D.
Patch to remove Nutch formating from logs
Hello, Here is a patch to change org.apache.nutch.util.LogFormatter to not insert itself as the default handler for the system. I have been using Nutch for a year and have been waiting for a version that I can embed into OpenEdit. The problem has been that Nutch inserts itself as the formatter for the Java log system and that interferes with OpenEdit logging. -- 513-542-3401 [EMAIL PROTECTED] http://www.openedit.org diff -Naur ../java/org/apache/nutch/util/LogFormatter.java java/org/apache/nutch/util/LogFormatter.java --- ../java/org/apache/nutch/util/LogFormatter.java 2006-03-31 13:40:50.0 -0500 +++ java/org/apache/nutch/util/LogFormatter.java2006-04-05 16:27:59.0 -0400 @@ -16,13 +16,23 @@ package org.apache.nutch.util; -import java.util.logging.*; -import java.io.*; -import java.text.*; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.io.PrintStream; +import java.io.PrintWriter; +import java.io.StringWriter; +import java.text.FieldPosition; +import java.text.SimpleDateFormat; import java.util.Date; - -/** Prints just the date and the log message. */ - +import java.util.logging.Formatter; +import java.util.logging.Level; +import java.util.logging.LogRecord; +import java.util.logging.Logger; + +/** Prints just the date and the log message. + * This was also used to stop processing as nutch crawls a web site + * [EMAIL PROTECTED] changed this code to use a LogWrapper class to catch severe errors + * */ public class LogFormatter extends Formatter { private static final String FORMAT = yyMMdd HHmmss; private static final String NEWLINE = System.getProperty(line.separator); @@ -35,20 +45,27 @@ private static boolean showTime = true; private static boolean showThreadIDs = false; + protected static LogFormatter sharedformatter = new LogFormatter(); + protected static SevereLogHandler sharedhandler = new SevereLogHandler(sharedformatter); + + /* // install when this class is loaded static { Handler[] handlers = LogFormatter.getLogger().getHandlers(); for (int i = 0; i handlers.length; i++) { - handlers[i].setFormatter(new LogFormatter()); + handlers[i].setFormatter(sharedformatter); handlers[i].setLevel(Level.FINEST); } } - + */ /** Gets a logger and, as a side effect, installs this as the default * formatter. */ public static Logger getLogger(String name) { // just referencing this class installs it -return Logger.getLogger(name); + Logger logr = Logger.getLogger(name); + logr.addHandler(sharedhandler); + + return logr; } /** When true, time is logged with each entry. */ @@ -60,7 +77,10 @@ public static void setShowThreadIDs(boolean showThreadIDs) { LogFormatter.showThreadIDs = showThreadIDs; } - + public void setLoggedSevere( boolean inSevere ) + { + loggedSevere = inSevere; + } /** * Format the given LogRecord. * @param record the log record to be formatted. diff -Naur ../java/org/apache/nutch/util/SevereLogHandler.java java/org/apache/nutch/util/SevereLogHandler.java --- ../java/org/apache/nutch/util/SevereLogHandler.java 1969-12-31 19:00:00.0 -0500 +++ java/org/apache/nutch/util/SevereLogHandler.java2006-04-05 16:29:20.0 -0400 @@ -0,0 +1,46 @@ +/* + * Created on Apr 5, 2006 + */ +package org.apache.nutch.util; + +import java.util.logging.Handler; +import java.util.logging.Level; +import java.util.logging.LogRecord; + +public class SevereLogHandler extends Handler +{ + protected LogFormatter fieldNutchFormatter; + + public SevereLogHandler(LogFormatter inFormatter) + { + setNutchFormatter(inFormatter); + } + + protected LogFormatter getNutchFormatter() + { + return fieldNutchFormatter; + } + + protected void setNutchFormatter(LogFormatter inNutchFormatter) + { + fieldNutchFormatter = inNutchFormatter; + } + + public void publish(LogRecord inRecord) + { + if ( inRecord.getLevel().intValue() == Level.SEVERE.intValue()) + { + getNutchFormatter().setLoggedSevere(true); + } + } + + public void flush() + { + } + + public void close() throws SecurityException + { + } + + +}