[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922282#comment-13922282 ]
Vangelis Karvounis commented on NUTCH-1478: ------------------------------------------- Hi! I have a few questions on how to run this patch: 1. In nutch-site.xml: <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-domain|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)|urlnormalizer-(pass|regex|basic)|scoring-opic</value> <description> </description> </property> 2. In nutch-site.xml can you tell us how to use those 4 new properties? <property> <name>index.parse.md</name> <value>description,keywords</value> <description></description> </property> <property> <name>index.content.md</name> <value></value> <description> </description> </property> <property> <name>index.db.md</name> <value></value> <description> </description> </property> <!-- parse-metatags plugin properties --> <property> <name>description;keywords</name> <value>*</value> <description> </description> </property> 3. I read somewhere that we need to input <field name="metatag.description" type="string" stored="true" indexed="true"/> in schema.xml both in solr and nutch. Is that correct? 4. I want to see my chosen metatags at MySQL, for I find it more useful for my queries. Any ideas how to implement this? 5. I want to crawl a page for <meta og:video> or <meta twitter: image> . Any ideas???? > Parse-metatags and index-metadata plugin for Nutch 2.x series > -------------------------------------------------------------- > > Key: NUTCH-1478 > URL: https://issues.apache.org/jira/browse/NUTCH-1478 > Project: Nutch > Issue Type: Improvement > Components: parser > Affects Versions: 2.1 > Reporter: kiran > Fix For: 2.3 > > Attachments: NUTCH-1478-parse-v2.patch, NUTCH-1478v3.patch, > NUTCH-1478v4.patch, NUTCH-1478v5.patch, Nutch1478.patch, Nutch1478.zip, > metadata_parseChecker_sites.png > > > I have ported parse-metatags and index-metadata plugin to Nutch 2.x series. > This will take multiple values of same tag and index in Solr as i patched > before (https://issues.apache.org/jira/browse/NUTCH-1467). > The usage is same as described here > (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is > no need to give 'metatag' keyword before metatag names. For example my > configuration looks like this > (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) > > This is only the first version and does not include the junit test. I will > update the new version soon. > This will parse the tags and index the tags in Solr. Make sure you create the > fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr. > Please let me know if you have any suggestions > This is supported by DLA (Digital Library and Archives) of Virginia Tech. -- This message was sent by Atlassian JIRA (v6.2#6252)