[ https://issues.apache.org/jira/browse/NUTCH-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1585: ---------------------------------------- Attachment: NUTCH-1585-trunk.patch NUTCH-1585-2.x.patch patches for trunk and 2.x. Simply check if the tag exists in the set. If it doesn't then add it. I suppose this is difficult/expensive if the set is huge, however by doing this check, the set is logically much much smaller than it would be otherwise. > Ensure duplicate tags do not exist in microformat-reltag tag set. > ----------------------------------------------------------------- > > Key: NUTCH-1585 > URL: https://issues.apache.org/jira/browse/NUTCH-1585 > Project: Nutch > Issue Type: Improvement > Components: parser > Affects Versions: 1.6, 2.2 > Reporter: Lewis John McGibbney > Fix For: 2.3, 1.8 > > Attachments: NUTCH-1585-2.x.patch, NUTCH-1585-trunk.patch > > > A WebPage can have many many embedded tags and other such markup. > Creating huge tag lists containing many many duplicates is counter productive > to the process of parsing and extracting out such structure. > We should add a mechanism to only include single tag occurrences for the > microformats-reltag parser. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira