[ https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939336#comment-16939336 ]
Sebastian Nagel commented on NUTCH-2381: ---------------------------------------- Good point: "A particular iteration order is not specified for {{HashMap}} objects - any code that depends on iteration order should be fixed." ([https://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html)] Will provide fix. CAVEAT: while it makes the TextProfileSignature more reliable, it will change the signatures in an already existing CrawlDb. > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > -------------------------------------------------------------------------------------------------------------- > > Key: NUTCH-2381 > URL: https://issues.apache.org/jira/browse/NUTCH-2381 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 1.13 > Reporter: Rodrigo Joni Sestari > Priority: Major > Labels: signature > Fix For: 1.16 > > > In some situations the class TextProfileSignature gives different signatures > for the same text "profile" page. > The method TextProfileSignature.calculate uses a HashMap to salve the tokens, > after some process, the tokens come sorted by decreasing frequency. > For some pages like "http://curia.europa.eu/jcms/" the text "profile" is the > same but the signature come different for each fetch. > Its happens because the tokens are sorted only by decreasing frequency. > Tokens with the same frequency maybe not have the same order in different > fetchs. > The HashMap no guarantees as to the order of the map and not guarantee that > the order will remain constant over time. > My suggestion is change the methods TokenComparator.compare in order to sort > by frequency and Name. > Rodrigo -- This message was sent by Atlassian Jira (v8.3.4#803005)