[ 
https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398296#comment-13398296
 ] 

Kristof  commented on NUTCH-1406:
---------------------------------

Markus, I will provide the patch against trunk. But since I used the 
metatags-plugin+tutorial.zip provided under #NUTCH-809, I need to transfer the 
adjustments to the trunk files. Have some problems with building the classes 
with ant and will come back to fixing it after the weekend once I have more 
time to look into this.
Julien, thanks for the link.
                
> Metatags-index/-parse plugin: conversion to Solr date format and prevents 
> parsing/indexing of empty tags
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1406
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1406
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>            Reporter: Kristof 
>            Priority: Minor
>              Labels: conversion, date
>         Attachments: index-metatags.jar
>
>
> This improvement to the index-metatags plugin (sometimes also refered to 
> parse-metatags plugin) allows for conversion of selected fields to the Solr 
> date format and prevents parsing/indexing of metatags that do not contain any 
> content.
> In order to convert the values of selected metatags to Solr date format, you 
> must specify in nutch-site.xml. The example used is a simple Dublin Core 
> element dc.date. It must also be defined in the metatags.names property.
> {code}
> <property>
>       <name>metatags.convert</name>
>       <value>dc.date</value>
>       <description>For plugin index-metatags: Indicate here the name of the 
> html meta tag that should be converted to date format.
>       </description>
> </property>
> {code}
> I read that SimpleDateFormat format is not a robust solution, so this 
> improvement might have some problems.
> So far it worked well for me. Below more details about the changes.
> Changes made to MetaTagsIndexer.java between lines 41 and 71:
> {code}
>       if (tagEntry != null && tagEntry.trim().length() > 0)
>       {       
>               if (checkDateConversion(metatag)) {
>                       
>                       Date date = null;
>                       
>                       try {
>                               date = new 
> SimpleDateFormat("yyyy-MM-dd").parse(tagEntry);
>                               doc.add(metatag, date);
>                       } catch (ParseException e) {
>                               e.printStackTrace();
>                                       
>                               if (LOG.isTraceEnabled()) {
>               LOG.trace(url.toString() + " : date conversion failed for " + 
> tagEntry + " in " + metatag + " field");
>                               }
>                       }
>               }
>               else {
>                       doc.add(metatag, tagEntry);
>               }
>                             
>               if (LOG.isTraceEnabled()) {
>                       LOG.trace(url.toString() + " : successfully added " + 
> tagEntry + " to the " + metatag + " field");
>               }
>       }
>       else {
>                                       
>               if (LOG.isTraceEnabled()) {
>                       LOG.trace(url.toString() + " : " + metatag + " and " + 
> tagEntry + " not added as Metatag does not have any content");
>               }       
>       }
> {code}
> Method added to MetaTagsIndexer.java:
> {code}
>       public boolean checkDateConversion (String metatag){
>               String convertToDate = conf.get("metatags.convert", "*");       
>               String[] fieldsToConvert = convertToDate.split(";");
>               boolean convert = false; 
>                          
>               for (String check : fieldsToConvert)
>                       if (check.equals(metatag)) convert = true;              
>            
>               
>               return convert;
>       }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to