[ https://issues.apache.org/jira/browse/TIKA-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler updated TIKA-1303: ------------------------------ Fix Version/s: (was: 1.7) 1.6 > Parsing Html page (not well formed) containing two title tags results in > metadata (title) to be overwritten > ----------------------------------------------------------------------------------------------------------- > > Key: TIKA-1303 > URL: https://issues.apache.org/jira/browse/TIKA-1303 > Project: Tika > Issue Type: Bug > Components: metadata, parser > Affects Versions: 1.2, 1.3, 1.4, 1.5 > Reporter: Hassan Akram > Assignee: Ken Krugler > Priority: Minor > Labels: patch > Fix For: 1.6 > > Attachments: HtmlHandler.java, HtmlParserTest.java, TIKA-1303.patch > > > While crawling following web page, we came accross a strange issue where by > title for page was not being extracted accurately: > http://www.samsung.com/us/support/faq/FAQ00052677/61239/SM-C105AZWAATT > This html page is not well formed and contains two title tags (one in head > and one is body): > e.g. "<html><title>Simple > Content</title><body><h1></h1><title>TitleToIgnore</title></body></html>" > Now in this case a simple fix to htmlhandler could make sure that once title > value has been set in metadata, it should not be overridden when another > title tag is subsequently found. > I am submitting fix for this issue as a path for review (1.5) - hoping that > this could be committed to latest? > Can you please review and update kindly. -- This message was sent by Atlassian JIRA (v6.2#6252)