Believe it or not I don't think that meta tags are currently stored. I looked through the html parsing code and didn't see anywhere that it could be storing it except in html filters. I see that meta tags are parsed and passed to the html filters but I didn't see any default filter that was storing them.
If there isn't a reason why we shouldn't be storing meta tags, if we aren't currently storing them (I could be missing where this is happening :) ), and this is something that people want then I can create an html filter that will store the meta-tags in the Parse MetaData. Dennis Kubes rubdabadub wrote: > Super thanks! Nice explanation. I finally got it :-) I mean how things > loads and why! Thank you! I do have one question though, however its a > bit different. But if you do have time a lengthy answer is always > welcome :-) > > Question: > When the content is parsed by -- let say parse-html or in order to > parse meta data for example.. > src/java/org/apache/nutch/parseHTMLMetaTags.java > > Now when I run the ParserChecker.java main method.. I don't see the > extracted data parsed the way it shows in parseHTMLMetaTags.. I see > only content, outlink, title etc.. no meta tag.. How is that happen .. > cos I am trying my best to read the code but I can't go beyond parse.. > I started at crawl :-) After looking through it > > I don't want to hi jack the thread i just thought you answered the > question so clearly.. > Regards > > On 3/2/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: >> Here goes with the short answer. :) >> >> Configuration has two levels, default and final. It is supplied by the >> org.apache.hadoop.conf.Configuration class and extended in Nutch by the >> org.apache.nutch.util.NutchConfiguration class. >> >> Although it is configurable, by default hadoop-default.xml and >> nutch-default.xml are default resources and hadoop-site.xml and >> nutch-site.xml are final resources. Resources (i.e. resource files) can >> be added by filename to either the default or final resource set and in >> fact this is how Nutch extends the Configuration class, by adding >> nutch-default.xml and nutch-site.xml. >> >> Final resource values overwrite default resource values and final >> resource values added later will overwrite final resource values added >> earlier. When I say values I am talking about the individual properties >> not the resource files. Resource files are found by name in the >> classpath with the HADOOP_CONF_DIR or NUTCH_CONF_DIR being configured in >> the nutch and hadoop scripts as the first setting in the classpath. You >> can change the conf dir to pull configuration files from different >> directories and many tools in nutch and hadoop now provide a -conf >> options on the command line to set the conf directory. >> >> So for your example if you define the property in hadoop-default.xml or >> nutch-default.xml and it is not defined in either hadoop-site.xml or >> nutch-site.xml then the property will stand. If you define the property >> in either nutch-site.xml or hadoop-site.xml then it will override >> nutch-default.xml and hadoop-default.xml settings. And if you define it >> in both hadoop-site.xml and nutch-site.xml then the nutch-site.xml will >> override the hadoop-site.xml settings because nutch-site.xml is added >> after hadoop-site.xml. And remember only individual properties are >> overridden not the entire file. >> >> Practically you should define properties having to do with Hadoop (i.e. >> the DFS, Mapreduce, etc) in the hadoop-site.xml and properties having to >> do with Nutch (i.e. fetcher, url-normalizers, etc) in the nutch-site.xml. >> >> Dennis Kubes >> >> Ricardo J. Méndez wrote: >> > Hi Gal, >> > >> > Thanks for the reply. >> > >> > What has me wondering is that several other plugins _are_ being loaded >> > when I define it on hadoop-site.xml, and actually that defining >> > plugin.folders on that file is the only way I've found so far of >> getting >> > plugins loaded at all when testing from Eclipse. >> > >> > Moreover, I get this problem even if I define it in both nutch-site and >> > hadoop-site, which would make it seem that the definition in >> > hadoop-site.xml does have an effect. I was assuming they overrode the >> > options from nutch-site.xml - am I mistaken? >> > >> > >> > Ricardo J. Méndez >> > http://ricardo.strangevistas.net/ >> > >> > Gal Nitzan wrote: >> >> Hi, >> >> >> >> Nutch loads its configuration from nutch-site and nutch-default.xml >> and not >> >> from hadoop conf files so the behavior is correct. >> >> >> >> HTH, >> >> >> >> Gal. >> >> >> >> >> >> On 3/1/07, "Ricardo J. Méndez" <[EMAIL PROTECTED]> wrote: >> >>> Hi, >> >>> >> >>> I'm using nutch-0.9, from the trunk. I've noticed a behavior >> >>> difference on a plugin unit test if I set the plugin.folders >> property on >> >>> nutch-site.xml vs. hadoop-site.xml. If I set it on >> nutch-site.xml, the >> >>> unit test works well, but an error is raised if it's on >> hadoop-site.xml >> >>> >> >>> The error is: >> >>> >> >>> [junit] WARN [main] (ParserFactory.java:196) - Canno initialize >> >>> parser parse-html (cause: >> >>> org.apache.nutch.plugin.PluginRuntimeException: >> >>> java.lang.ClassNotFoundException: >> org.apache.nutch.parse.html.HtmlParser >> >>> >> >>> >> >>> Is there a reason why the HtmlParser wouldn't be loaded when the >> >>> directory is specified on hadoop-site.xml? >> >>> >> >>> Thanks in advance, >> >>> >> >>> >> >>> >> >>> >> >>> Ricardo J. Méndez >> >>> http://ricardo.strangevistas.net/ >> >>> >> ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
