Sebastian Nagel created NUTCH-2143: -------------------------------------- Summary: GeneratorJob ignores batch id passed as argument Key: NUTCH-2143 URL: https://issues.apache.org/jira/browse/NUTCH-2143 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.3.1 Reporter: Sebastian Nagel Priority: Blocker Fix For: 2.3.1
The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl: {noformat} bin/nutch generate ... -batchId 1444941073-14208 ... GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs Fetching : bin/nutch fetch ... 1444941073-14208 ... ... QueueFeeder finished: total 0 records. Hit by time limit :0 {noformat} The generated URLs are marked with the wrong batch id: {noformat} hbase(main):010:0> scan 'test_webpage' ROW COLUMN+CELL org.apache.nutch:http/ column=f:bid, timestamp=1444941077080, value=1444941074-858443668 ... org.apache.nutch:http/ column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668 {noformat} and fetcher will not fetch anything. This problem was reported by Sherban Drulea [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html],[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)