[ https://issues.apache.org/jira/browse/NUTCH-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398296#comment-13398296 ]
Kristof commented on NUTCH-1406: --------------------------------- Markus, I will provide the patch against trunk. But since I used the metatags-plugin+tutorial.zip provided under #NUTCH-809, I need to transfer the adjustments to the trunk files. Have some problems with building the classes with ant and will come back to fixing it after the weekend once I have more time to look into this. Julien, thanks for the link. > Metatags-index/-parse plugin: conversion to Solr date format and prevents > parsing/indexing of empty tags > -------------------------------------------------------------------------------------------------------- > > Key: NUTCH-1406 > URL: https://issues.apache.org/jira/browse/NUTCH-1406 > Project: Nutch > Issue Type: Improvement > Components: indexer, parser > Reporter: Kristof > Priority: Minor > Labels: conversion, date > Attachments: index-metatags.jar > > > This improvement to the index-metatags plugin (sometimes also refered to > parse-metatags plugin) allows for conversion of selected fields to the Solr > date format and prevents parsing/indexing of metatags that do not contain any > content. > In order to convert the values of selected metatags to Solr date format, you > must specify in nutch-site.xml. The example used is a simple Dublin Core > element dc.date. It must also be defined in the metatags.names property. > {code} > <property> > <name>metatags.convert</name> > <value>dc.date</value> > <description>For plugin index-metatags: Indicate here the name of the > html meta tag that should be converted to date format. > </description> > </property> > {code} > I read that SimpleDateFormat format is not a robust solution, so this > improvement might have some problems. > So far it worked well for me. Below more details about the changes. > Changes made to MetaTagsIndexer.java between lines 41 and 71: > {code} > if (tagEntry != null && tagEntry.trim().length() > 0) > { > if (checkDateConversion(metatag)) { > > Date date = null; > > try { > date = new > SimpleDateFormat("yyyy-MM-dd").parse(tagEntry); > doc.add(metatag, date); > } catch (ParseException e) { > e.printStackTrace(); > > if (LOG.isTraceEnabled()) { > LOG.trace(url.toString() + " : date conversion failed for " + > tagEntry + " in " + metatag + " field"); > } > } > } > else { > doc.add(metatag, tagEntry); > } > > if (LOG.isTraceEnabled()) { > LOG.trace(url.toString() + " : successfully added " + > tagEntry + " to the " + metatag + " field"); > } > } > else { > > if (LOG.isTraceEnabled()) { > LOG.trace(url.toString() + " : " + metatag + " and " + > tagEntry + " not added as Metatag does not have any content"); > } > } > {code} > Method added to MetaTagsIndexer.java: > {code} > public boolean checkDateConversion (String metatag){ > String convertToDate = conf.get("metatags.convert", "*"); > String[] fieldsToConvert = convertToDate.split(";"); > boolean convert = false; > > for (String check : fieldsToConvert) > if (check.equals(metatag)) convert = true; > > > return convert; > } > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira