[ https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2143: ----------------------------------- Description: The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl: {noformat} bin/nutch generate ... -batchId 1444941073-14208 ... GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs Fetching : bin/nutch fetch ... 1444941073-14208 ... ... QueueFeeder finished: total 0 records. Hit by time limit :0 {noformat} The generated URLs are marked with the wrong batch id: {noformat} hbase(main):010:0> scan 'test_webpage' ROW COLUMN+CELL org.apache.nutch:http/ column=f:bid, timestamp=1444941077080, value=1444941074-858443668 ... org.apache.nutch:http/ column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668 {noformat} and fetcher will not fetch anything. This problem was reported by Sherban Drulea [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]]. was: The batch id passed to GeneratorJob by option/argument -batchId <id> is ignored and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl: {noformat} bin/nutch generate ... -batchId 1444941073-14208 ... GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs Fetching : bin/nutch fetch ... 1444941073-14208 ... ... QueueFeeder finished: total 0 records. Hit by time limit :0 {noformat} The generated URLs are marked with the wrong batch id: {noformat} hbase(main):010:0> scan 'test_webpage' ROW COLUMN+CELL org.apache.nutch:http/ column=f:bid, timestamp=1444941077080, value=1444941074-858443668 ... org.apache.nutch:http/ column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668 {noformat} and fetcher will not fetch anything. This problem was reported by Sherban Drulea [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html],[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]]. > GeneratorJob ignores batch id passed as argument > ------------------------------------------------ > > Key: NUTCH-2143 > URL: https://issues.apache.org/jira/browse/NUTCH-2143 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 2.3.1 > Reporter: Sebastian Nagel > Priority: Blocker > Fix For: 2.3.1 > > > The batch id passed to GeneratorJob by option/argument -batchId <id> is > ignored and a generated batch id is used to mark the current batch. Log > snippets from a run of bin/crawl: > {noformat} > bin/nutch generate ... -batchId 1444941073-14208 > ... > GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs > Fetching : > bin/nutch fetch ... 1444941073-14208 ... > ... > QueueFeeder finished: total 0 records. Hit by time limit :0 > {noformat} > The generated URLs are marked with the wrong batch id: > {noformat} > hbase(main):010:0> scan 'test_webpage' > ROW COLUMN+CELL > org.apache.nutch:http/ column=f:bid, timestamp=1444941077080, > value=1444941074-858443668 > ... > org.apache.nutch:http/ column=mk:_gnmrk_, timestamp=1444941077080, > value=1444941074-858443668 > {noformat} > and fetcher will not fetch anything. This problem was reported by Sherban > Drulea > [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], > [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)