Re: [Nutch-general] Behavior of nutch-site.xml vs. hadoop-site.xml

rubdabadub Fri, 02 Mar 2007 08:00:41 -0800

Super thanks! Nice explanation. I finally got it :-) I mean how things
loads and why! Thank you! I do have one question though, however its a
bit different. But if you do have time a lengthy answer is always
welcome :-)


Question:
When the content is parsed by -- let say parse-html or in order to
parse meta data for example..
src/java/org/apache/nutch/parseHTMLMetaTags.java

Now when I run the ParserChecker.java main method.. I don't see the
extracted data parsed the way it shows in parseHTMLMetaTags.. I see
only content, outlink, title etc.. no meta tag.. How is that happen ..
cos I am trying my best to read the code but I can't go beyond parse..
I started at crawl :-)

I don't want to hi jack the thread i just thought you answered the
question so clearly..
Regards

On 3/2/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
> Here goes with the short answer. :)
>
> Configuration has two levels, default and final.  It is supplied by the
> org.apache.hadoop.conf.Configuration class and extended in Nutch by the
>   org.apache.nutch.util.NutchConfiguration class.
>
> Although it is configurable, by default hadoop-default.xml and
> nutch-default.xml are default resources and hadoop-site.xml and
> nutch-site.xml are final resources.  Resources (i.e. resource files) can
> be added by filename to either the default or final resource set and in
> fact this is how Nutch extends the Configuration class, by adding
> nutch-default.xml and nutch-site.xml.
>
> Final resource values overwrite default resource values and final
> resource values added later will overwrite final resource values added
> earlier.  When I say values I am talking about the individual properties
> not the resource files.  Resource files are found by name in the
> classpath with the HADOOP_CONF_DIR or NUTCH_CONF_DIR being configured in
> the nutch and hadoop scripts as the first setting in the classpath.  You
> can change the conf dir to pull configuration files from different
> directories and many tools in nutch and hadoop now provide a -conf
> options on the command line to set the conf directory.
>
> So for your example if you define the property in hadoop-default.xml or
> nutch-default.xml and it is not defined in either hadoop-site.xml or
> nutch-site.xml then the property will stand.  If you define the property
> in either nutch-site.xml or hadoop-site.xml then it will override
> nutch-default.xml and hadoop-default.xml settings.  And if you define it
> in both hadoop-site.xml and nutch-site.xml then the nutch-site.xml will
> override the hadoop-site.xml settings because nutch-site.xml is added
> after hadoop-site.xml.  And remember only individual properties are
> overridden not the entire file.
>
> Practically you should define properties having to do with Hadoop (i.e.
> the DFS, Mapreduce, etc) in the hadoop-site.xml and properties having to
> do with Nutch (i.e. fetcher, url-normalizers, etc) in the nutch-site.xml.
>
> Dennis Kubes
>
> Ricardo J. Méndez wrote:
> > Hi Gal,
> >
> > Thanks for the reply.
> >
> > What has me wondering is that several other plugins _are_ being loaded
> > when I define it on hadoop-site.xml, and actually that defining
> > plugin.folders on that file is the only way I've found so far of getting
> > plugins loaded at all when testing from Eclipse.
> >
> > Moreover, I get this problem even if I define it in both nutch-site and
> > hadoop-site, which would make it seem that the definition in
> > hadoop-site.xml does have an effect.  I was assuming they overrode the
> > options from nutch-site.xml - am I mistaken?
> >
> >
> > Ricardo J. Méndez
> > http://ricardo.strangevistas.net/
> >
> > Gal Nitzan wrote:
> >> Hi,
> >>
> >> Nutch loads its configuration from nutch-site and nutch-default.xml and not
> >> from hadoop conf files so the behavior is correct.
> >>
> >> HTH,
> >>
> >> Gal.
> >>
> >>
> >> On 3/1/07, "Ricardo J. Méndez" <[EMAIL PROTECTED]> wrote:
> >>> Hi,
> >>>
> >>> I'm using nutch-0.9, from the trunk.    I've noticed a behavior
> >>> difference on a plugin unit test if I set the plugin.folders property on
> >>> nutch-site.xml vs. hadoop-site.xml.  If I set it on nutch-site.xml, the
> >>> unit test works well, but an error is raised if it's on hadoop-site.xml
> >>>
> >>> The error is:
> >>>
> >>>    [junit]  WARN [main] (ParserFactory.java:196) - Canno initialize
> >>> parser parse-html (cause:
> >>> org.apache.nutch.plugin.PluginRuntimeException:
> >>> java.lang.ClassNotFoundException: org.apache.nutch.parse.html.HtmlParser
> >>>
> >>>
> >>> Is there a reason why the HtmlParser wouldn't be loaded when the
> >>> directory is specified on hadoop-site.xml?
> >>>
> >>> Thanks in advance,
> >>>
> >>>
> >>>
> >>>
> >>> Ricardo J. Méndez
> >>> http://ricardo.strangevistas.net/
> >>>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Behavior of nutch-site.xml vs. hadoop-site.xml

Reply via email to