[ 
https://issues.apache.org/jira/browse/NUTCH-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522115#comment-14522115
 ] 

Jeff Cocking commented on NUTCH-1559:
-------------------------------------

In investigating this issue, it appears the MetaTagsParser.java is loading the 
info twice.  There is code placed in MetaTagsParser to handle metatags not 
handled by Tika. 

The Tika plugin copies all the Tika metadata into the nutch metadata. 
TikaParser.java (around line 184):
        // populate Nutch metadata with Tika metadata
        String[] TikaMDNames = tikamd.names();
        for (String tikaMDName : TikaMDNames) {
            if (tikaMDName.equalsIgnoreCase(Metadata.TITLE))
                continue;
            // TODO what if multivalued?
            nutchMetadata.add(tikaMDName, tikamd.get(tikaMDName));
        }

The MetaTagsParser is setup to parse both Tika metadata and Nutch metadata.  
This is the reason for the duplicate values.
MetaTagsParser.java (around line 104)
    // check in the metadata first : the tika-parser
    // might have stored the values there already
    for (String mdName : metadata.names()) {
      addIndexedMetatags(metadata, mdName, metadata.getValues(mdName));
    }

    Metadata generalMetaTags = metaTags.getGeneralTags();
    for (String tagName : generalMetaTags.names()) {
      addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName));
    }

> parse-metatags duplicates extracted metatags in combination with parse-tika
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1559
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1559
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.11
>
>
> If the plugin parse-metatags is used in combination with parse-tika, the 
> extracted metatags (the pairs <name, value>) are duplicated.
> The metatags are found twice in parse.getData().getParseMeta() and in 
> metaTags.getGeneralTags(). Is this necessary? Maybe we should fix parse-tika 
> in this point?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to