[ http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359649 ]
byron miller commented on NUTCH-134: ------------------------------------ I would take more cpu for better summaries any day :) cpu power is cheaper than manual intervention! If any testing is needed, don't hesitate to drop me a patch.. i've been working on a 500million page index using mapred branch on a 10 node cluster so i have plenty of numbers to test against. > Summarizer doesn't select the best snippets > ------------------------------------------- > > Key: NUTCH-134 > URL: http://issues.apache.org/jira/browse/NUTCH-134 > Project: Nutch > Type: Bug > Components: searcher > Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev > Reporter: Andrzej Bialecki > > Summarizer.java tries to select the best fragments from the input text, where > the frequency of query terms is the highest. However, the logic in line 223 > is flawed in that the excerptSet.add() operation will add new excerpts only > if they are not already present - the test is performed using the Comparator > that compares only the numUniqueTokens. This means that if there are two or > more excerpts, which score equally high, only the first of them will be > retained, and the rest of equally-scoring excerpts will be discarded, in > favor of other excerpts (possibly lower scoring). > To fix this the Set should be replaced with a List + a sort operation. To > keep the relative position of excerpts in the original order the Excerpt > class should be extended with an "int order" field, and the collected > excerpts should be sorted in that order prior to adding them to the summary. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira