Believe it or not I don't think that meta tags are currently stored.  I 
looked through the html parsing code and didn't see anywhere that it 
could be storing it except in html filters.  I see that meta tags are 
parsed and passed to the html filters but I didn't see any default 
filter that was storing them.

If there isn't a reason why we shouldn't be storing meta tags, if we 
aren't currently storing them (I could be missing where this is 
happening :) ), and this is something that people want then I can create 
an html filter that will store the meta-tags in the Parse MetaData.

Dennis Kubes

rubdabadub wrote:
> Super thanks! Nice explanation. I finally got it :-) I mean how things
> loads and why! Thank you! I do have one question though, however its a
> bit different. But if you do have time a lengthy answer is always
> welcome :-)
> 
> Question:
> When the content is parsed by -- let say parse-html or in order to
> parse meta data for example..
> src/java/org/apache/nutch/parseHTMLMetaTags.java
> 
> Now when I run the ParserChecker.java main method.. I don't see the
> extracted data parsed the way it shows in parseHTMLMetaTags.. I see
> only content, outlink, title etc.. no meta tag.. How is that happen ..
> cos I am trying my best to read the code but I can't go beyond parse..
> I started at crawl :-)

After looking through it
> 
> I don't want to hi jack the thread i just thought you answered the
> question so clearly..
> Regards
> 
> On 3/2/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>> Here goes with the short answer. :)
>>
>> Configuration has two levels, default and final.  It is supplied by the
>> org.apache.hadoop.conf.Configuration class and extended in Nutch by the
>>   org.apache.nutch.util.NutchConfiguration class.
>>
>> Although it is configurable, by default hadoop-default.xml and
>> nutch-default.xml are default resources and hadoop-site.xml and
>> nutch-site.xml are final resources.  Resources (i.e. resource files) can
>> be added by filename to either the default or final resource set and in
>> fact this is how Nutch extends the Configuration class, by adding
>> nutch-default.xml and nutch-site.xml.
>>
>> Final resource values overwrite default resource values and final
>> resource values added later will overwrite final resource values added
>> earlier.  When I say values I am talking about the individual properties
>> not the resource files.  Resource files are found by name in the
>> classpath with the HADOOP_CONF_DIR or NUTCH_CONF_DIR being configured in
>> the nutch and hadoop scripts as the first setting in the classpath.  You
>> can change the conf dir to pull configuration files from different
>> directories and many tools in nutch and hadoop now provide a -conf
>> options on the command line to set the conf directory.
>>
>> So for your example if you define the property in hadoop-default.xml or
>> nutch-default.xml and it is not defined in either hadoop-site.xml or
>> nutch-site.xml then the property will stand.  If you define the property
>> in either nutch-site.xml or hadoop-site.xml then it will override
>> nutch-default.xml and hadoop-default.xml settings.  And if you define it
>> in both hadoop-site.xml and nutch-site.xml then the nutch-site.xml will
>> override the hadoop-site.xml settings because nutch-site.xml is added
>> after hadoop-site.xml.  And remember only individual properties are
>> overridden not the entire file.
>>
>> Practically you should define properties having to do with Hadoop (i.e.
>> the DFS, Mapreduce, etc) in the hadoop-site.xml and properties having to
>> do with Nutch (i.e. fetcher, url-normalizers, etc) in the nutch-site.xml.
>>
>> Dennis Kubes
>>
>> Ricardo J. Méndez wrote:
>> > Hi Gal,
>> >
>> > Thanks for the reply.
>> >
>> > What has me wondering is that several other plugins _are_ being loaded
>> > when I define it on hadoop-site.xml, and actually that defining
>> > plugin.folders on that file is the only way I've found so far of 
>> getting
>> > plugins loaded at all when testing from Eclipse.
>> >
>> > Moreover, I get this problem even if I define it in both nutch-site and
>> > hadoop-site, which would make it seem that the definition in
>> > hadoop-site.xml does have an effect.  I was assuming they overrode the
>> > options from nutch-site.xml - am I mistaken?
>> >
>> >
>> > Ricardo J. Méndez
>> > http://ricardo.strangevistas.net/
>> >
>> > Gal Nitzan wrote:
>> >> Hi,
>> >>
>> >> Nutch loads its configuration from nutch-site and nutch-default.xml 
>> and not
>> >> from hadoop conf files so the behavior is correct.
>> >>
>> >> HTH,
>> >>
>> >> Gal.
>> >>
>> >>
>> >> On 3/1/07, "Ricardo J. Méndez" <[EMAIL PROTECTED]> wrote:
>> >>> Hi,
>> >>>
>> >>> I'm using nutch-0.9, from the trunk.    I've noticed a behavior
>> >>> difference on a plugin unit test if I set the plugin.folders 
>> property on
>> >>> nutch-site.xml vs. hadoop-site.xml.  If I set it on 
>> nutch-site.xml, the
>> >>> unit test works well, but an error is raised if it's on 
>> hadoop-site.xml
>> >>>
>> >>> The error is:
>> >>>
>> >>>    [junit]  WARN [main] (ParserFactory.java:196) - Canno initialize
>> >>> parser parse-html (cause:
>> >>> org.apache.nutch.plugin.PluginRuntimeException:
>> >>> java.lang.ClassNotFoundException: 
>> org.apache.nutch.parse.html.HtmlParser
>> >>>
>> >>>
>> >>> Is there a reason why the HtmlParser wouldn't be loaded when the
>> >>> directory is specified on hadoop-site.xml?
>> >>>
>> >>> Thanks in advance,
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> Ricardo J. Méndez
>> >>> http://ricardo.strangevistas.net/
>> >>>
>>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to