[ https://issues.apache.org/jira/browse/NUTCH-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2379: ----------------------------------- Fix Version/s: 1.17 > crawl script dedup's crawldb update is slow > -------------------------------------------- > > Key: NUTCH-2379 > URL: https://issues.apache.org/jira/browse/NUTCH-2379 > Project: Nutch > Issue Type: Bug > Components: bin > Affects Versions: 1.11 > Environment: shell > Reporter: Michael Coffey > Priority: Minor > Fix For: 1.17 > > > In the standard crawl script, there is a _bin_nutch updatedb command and, > soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs > with "crawldb /path/to/crawl/db" in their names (in addition to the actual > deduplication job). > In my situation, the "crawldb" job launched by dedup takes twice as long as > the one launched by updatedb. > I notice that the script passes $commonOptions to updatedb but not to dedup. > I suspect that the crawldb update launched by dedup may not be compressing > its output. -- This message was sent by Atlassian Jira (v8.3.4#803005)