SegmentReader -get shows wrong data
-----------------------------------

                 Key: NUTCH-1252
                 URL: https://issues.apache.org/jira/browse/NUTCH-1252
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.4, 1.5
            Reporter: Sebastian Nagel
         Attachments: NUTCH-1252.patch

The command/option -get of the SegmentReader may show wrong data associated 
with the given URL. 

To reproduce:
{code}
% mkdir -p test_readseg/urls
% echo -e 
"http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0";
 > test_readseg/urls/seeds

% nutch inject test_readseg/crawldb test_readseg/urls
Injector: starting at 2012-01-18 09:32:25
Injector: crawlDb: test_readseg/crawldb
Injector: urlDir: test_readseg/urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03

% nutch generate test_readseg/crawldb test_readseg/segments/
Generator: starting at 2012-01-18 09:32:30
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: test_readseg/segments/20120118093232
Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03

% nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' 
-nocontent -noparse -nofetch -noparsedata -noparsetext
SegmentReader: get 'http://nutch.apache.org/'
Crawl Generate::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Wed Jan 18 09:32:26 CET 2012
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 10.0
Signature: null
Metadata: _ngt_: 1326875550401test: AbcTest
{code}

The metadata and the score indicate that the CrawlDatum shown is the wrong one 
(that associated to http://abc.test/ but not to http://nutch.apache.org/).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to