[ https://issues.apache.org/jira/browse/TIKA-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020021#comment-14020021 ]
Lewis John McGibbney commented on TIKA-1303: -------------------------------------------- Can someone mark for 1.6 fix? (or 1.7) > Parsing Html page (not well formed) containing two title tags results in > metadata (title) to be overwritten > ----------------------------------------------------------------------------------------------------------- > > Key: TIKA-1303 > URL: https://issues.apache.org/jira/browse/TIKA-1303 > Project: Tika > Issue Type: Bug > Components: metadata, parser > Affects Versions: 1.2, 1.3, 1.4, 1.5 > Reporter: Hassan Akram > Priority: Minor > Labels: patch > Fix For: 1.5 > > Attachments: HtmlHandler.java, HtmlParserTest.java > > > While crawling following web page, we came accross a strange issue where by > title for page was not being extracted accurately: > http://www.samsung.com/us/support/faq/FAQ00052677/61239/SM-C105AZWAATT > This html page is not well formed and contains two title tags (one in head > and one is body): > e.g. "<html><title>Simple > Content</title><body><h1></h1><title>TitleToIgnore</title></body></html>" > Now in this case a simple fix to htmlhandler could make sure that once title > value has been set in metadata, it should not be overridden when another > title tag is subsequently found. > I am submitting fix for this issue as a path for review (1.5) - hoping that > this could be committed to latest? > Can you please review and update kindly. -- This message was sent by Atlassian JIRA (v6.2#6252)