[ https://issues.apache.org/jira/browse/NUTCH-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2230: ----------------------------------- Fix Version/s: 2.5 > Nutch doesn't index all URLs found > ---------------------------------- > > Key: NUTCH-2230 > URL: https://issues.apache.org/jira/browse/NUTCH-2230 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 2.3.1 > Environment: MongoDB with WiredTiger storage engine (3.2 but probably > affects other versions as well) > Reporter: Aaron Cosand > Priority: Major > Fix For: 2.5 > > > The initial query run by the generator task, against mongodb, doesn't force > ordering by _id. This causes an incorrect selection of ranges for successive > map-reduce related queries. The successive queries do appear to be getting > run in the correct order since _id is always indexed, but they should also > explicitly specify a sort, since you are not guaranteed a particular order > otherwise. I didn't dig deep enough to see if the root of the problem is > with nutch or gora, and whether it only affected mongo or could affect other > databases as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)