Super thanks! Nice explanation. I finally got it :-) I mean how things loads and why! Thank you! I do have one question though, however its a bit different. But if you do have time a lengthy answer is always welcome :-)
Question: When the content is parsed by -- let say parse-html or in order to parse meta data for example.. src/java/org/apache/nutch/parseHTMLMetaTags.java Now when I run the ParserChecker.java main method.. I don't see the extracted data parsed the way it shows in parseHTMLMetaTags.. I see only content, outlink, title etc.. no meta tag.. How is that happen .. cos I am trying my best to read the code but I can't go beyond parse.. I started at crawl :-) I don't want to hi jack the thread i just thought you answered the question so clearly.. Regards On 3/2/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: > Here goes with the short answer. :) > > Configuration has two levels, default and final. It is supplied by the > org.apache.hadoop.conf.Configuration class and extended in Nutch by the > org.apache.nutch.util.NutchConfiguration class. > > Although it is configurable, by default hadoop-default.xml and > nutch-default.xml are default resources and hadoop-site.xml and > nutch-site.xml are final resources. Resources (i.e. resource files) can > be added by filename to either the default or final resource set and in > fact this is how Nutch extends the Configuration class, by adding > nutch-default.xml and nutch-site.xml. > > Final resource values overwrite default resource values and final > resource values added later will overwrite final resource values added > earlier. When I say values I am talking about the individual properties > not the resource files. Resource files are found by name in the > classpath with the HADOOP_CONF_DIR or NUTCH_CONF_DIR being configured in > the nutch and hadoop scripts as the first setting in the classpath. You > can change the conf dir to pull configuration files from different > directories and many tools in nutch and hadoop now provide a -conf > options on the command line to set the conf directory. > > So for your example if you define the property in hadoop-default.xml or > nutch-default.xml and it is not defined in either hadoop-site.xml or > nutch-site.xml then the property will stand. If you define the property > in either nutch-site.xml or hadoop-site.xml then it will override > nutch-default.xml and hadoop-default.xml settings. And if you define it > in both hadoop-site.xml and nutch-site.xml then the nutch-site.xml will > override the hadoop-site.xml settings because nutch-site.xml is added > after hadoop-site.xml. And remember only individual properties are > overridden not the entire file. > > Practically you should define properties having to do with Hadoop (i.e. > the DFS, Mapreduce, etc) in the hadoop-site.xml and properties having to > do with Nutch (i.e. fetcher, url-normalizers, etc) in the nutch-site.xml. > > Dennis Kubes > > Ricardo J. Méndez wrote: > > Hi Gal, > > > > Thanks for the reply. > > > > What has me wondering is that several other plugins _are_ being loaded > > when I define it on hadoop-site.xml, and actually that defining > > plugin.folders on that file is the only way I've found so far of getting > > plugins loaded at all when testing from Eclipse. > > > > Moreover, I get this problem even if I define it in both nutch-site and > > hadoop-site, which would make it seem that the definition in > > hadoop-site.xml does have an effect. I was assuming they overrode the > > options from nutch-site.xml - am I mistaken? > > > > > > Ricardo J. Méndez > > http://ricardo.strangevistas.net/ > > > > Gal Nitzan wrote: > >> Hi, > >> > >> Nutch loads its configuration from nutch-site and nutch-default.xml and not > >> from hadoop conf files so the behavior is correct. > >> > >> HTH, > >> > >> Gal. > >> > >> > >> On 3/1/07, "Ricardo J. Méndez" <[EMAIL PROTECTED]> wrote: > >>> Hi, > >>> > >>> I'm using nutch-0.9, from the trunk. I've noticed a behavior > >>> difference on a plugin unit test if I set the plugin.folders property on > >>> nutch-site.xml vs. hadoop-site.xml. If I set it on nutch-site.xml, the > >>> unit test works well, but an error is raised if it's on hadoop-site.xml > >>> > >>> The error is: > >>> > >>> [junit] WARN [main] (ParserFactory.java:196) - Canno initialize > >>> parser parse-html (cause: > >>> org.apache.nutch.plugin.PluginRuntimeException: > >>> java.lang.ClassNotFoundException: org.apache.nutch.parse.html.HtmlParser > >>> > >>> > >>> Is there a reason why the HtmlParser wouldn't be loaded when the > >>> directory is specified on hadoop-site.xml? > >>> > >>> Thanks in advance, > >>> > >>> > >>> > >>> > >>> Ricardo J. Méndez > >>> http://ricardo.strangevistas.net/ > >>> > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
