[ 
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13541428#comment-13541428
 ] 

kiran commented on NUTCH-1478:
------------------------------

Hi Jaap,

Parsechecker should work if the field (metatags.names) is configured and the 
two plugins are copied in to the plugin folder. 

I have ran the same command as you did and looks like there is no actual 
metadata in that page. I should have used the nutch website as an example. 
Please check the attached screenshot of different websites i parsed and the 
metadata with it. 

Once parsechecker is working, we should make sure the indexing is the working. 
For that, we need to define what fields we want to be indexed in 
(index.parse.md) field in nutch-site.xml. There is a difference in 1.x and 2.x 
in the way this field should be defined. 

When i was working with this plugin, i was able to define the metatag fields as 
it is and the same way in the schema and it worked for me. This is my schema 
(https://github.com/salvager/apache-solr-4.0.0-BETA/blob/master/example/solr/ejournals/conf/schema.xml).
 

The dc fields that i have defined are particular to the website i am crawling. 
They might not be present in all the websites. 

I hope this helps. 


                
> Parse-metatags and index-metadata plugin for Nutch 2.x series 
> --------------------------------------------------------------
>
>                 Key: NUTCH-1478
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1478
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.1
>            Reporter: kiran
>         Attachments: metadata_parseChecker_sites.png, Nutch1478.patch, 
> Nutch1478.zip
>
>
> I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
> This will take multiple values of same tag and index in Solr as i patched 
> before (https://issues.apache.org/jira/browse/NUTCH-1467).
> The usage is same as described here 
> (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
> no need to give 'metatag' keyword before metatag names. For example my 
> configuration looks like this 
> (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
>  
> This is only the first version and does not include the junit test. I will 
> update the new version soon.
> This will parse the tags and index the tags in Solr. Make sure you create the 
> fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
> Please let me know if you have any suggestions
> This is supported by DLA (Digital Library and Archives) of Virginia Tech.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to