Demo HTML parser gives incorrect summaries when title is repeated as a heading
------------------------------------------------------------------------------
Key: LUCENE-590
URL: http://issues.apache.org/jira/browse/LUCENE-590
Project: Lucene - Java
Type: Bug
Components: Examples
Versions: 2.0.0
Reporter: Curtis d'Entremont
If you have an html document where the title is repeated as a heading at the
top of the document, the HTMLParser will return the title as the summary,
ignoring everything else that was added to the summary. Instead, it should keep
the rest of the summary and chop off the title part at the beginning
(essentially the opposite). I don't see any benefit to repeating the title in
the summary for any case.
In HTMLParser.jj's getSummary():
String sum = summary.toString().trim();
String tit = getTitle();
if (sum.startsWith(tit) || sum.equals(""))
return tit;
else
return sum;
change it to: (* denotes a line that has changed)
String sum = summary.toString().trim();
String tit = getTitle();
* if (sum.startsWith(tit)) // don't repeat title in summary
* return sum.substring(tit.length()).trim();
else
return sum;
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]