[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12378458 ]
Doug Cutting commented on NUTCH-134: ------------------------------------ +1 for Summary as Writable and change HitSummarizer.getSummary() to return a Summary directly rather than a String. I don't think this has bad performance implications. > Summarizer doesn't select the best snippets > ------------------------------------------- > > Key: NUTCH-134 > URL: http://issues.apache.org/jira/browse/NUTCH-134 > Project: Nutch > Type: Bug > Components: searcher > Versions: 0.7.2, 0.7.1, 0.7, 0.8-dev > Reporter: Andrzej Bialecki > Attachments: summarizer.060506.patch > > Summarizer.java tries to select the best fragments from the input text, where > the frequency of query terms is the highest. However, the logic in line 223 > is flawed in that the excerptSet.add() operation will add new excerpts only > if they are not already present - the test is performed using the Comparator > that compares only the numUniqueTokens. This means that if there are two or > more excerpts, which score equally high, only the first of them will be > retained, and the rest of equally-scoring excerpts will be discarded, in > favor of other excerpts (possibly lower scoring). > To fix this the Set should be replaced with a List + a sort operation. To > keep the relative position of excerpts in the original order the Excerpt > class should be extended with an "int order" field, and the collected > excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira