[ 
https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kiran resolved NUTCH-1467.
--------------------------

    Resolution: Implemented

I have made a patch file (attached below) which will solve the above problem. 
I do not think its the best method to do it but thats a temporary solution for 
me now and i am posting it here. 

For Example if there are two tags like this with the same name :

<meta name="DC.creator" content="R.L. Ticknor">
<meta name="DC.creator" content="J.E. Long">

The parser (after patch applied) will save the values as 
(metatag.dc.creator=R.L. Ticknor,J.E. Long), separated by commas . 

Previously only second value used to be saved since java properties class was 
used to save the names and values.

The patch is for the file HTMLMetaProcessor.java in the path 
($NUTCH_HOME/src/plugin/parse-html/src/java/org/apache/nutch/parse/html). 

It would have been great if i could save the values as an array instead of 
comma but since properties was used to save names and values, i thought its 
best to keep it separated by commas.

Whoever will use the crawled meta values, please use split(',') function for 
the multi values.

Please let me know if you have any suggestions. 

                
> nutch 1.5.1 not able to parse mutliValued metatags
> --------------------------------------------------
>
>                 Key: NUTCH-1467
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1467
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.5.1
>            Reporter: kiran
>            Priority: Minor
>         Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using 
> http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when 
> there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it 
> work ?
> When there are two tags with same name and different content, it takes the 
> value of the later tag and saves it rather than creating a multiValue field.
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to