[ 
https://issues.apache.org/jira/browse/NUTCH-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1930:
----------------------------------------
    Fix Version/s:     (was: 2.3.1)
                   2.4

> Fetcher erases Markers for certain URLs / documents
> ---------------------------------------------------
>
>                 Key: NUTCH-1930
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1930
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 2.3
>            Reporter: Michiel
>             Fix For: 2.4
>
>
> During an active crawling project, I noticed what appears to be a bug in the 
> fetcher: the markers for certain pages (PDFs especially) are either not 
> saved, or erased altogether. The pages are thus not parsed, nor updated in 
> the DB. They keep appearing in the generate lists and fetch lists. Note that 
> this is a separate issue from NUTCH-1922. That one involves correctly parsed 
> pages. This bug prevents certain pages from getting correct markers set.
> Although I'm still new to Nutch and no java expert, I'm currently trying to 
> debug this. Because it seems to be rather easy to replicate the error, so it 
> seemed sensible to share my findings so far. If I find out more myself, I'll 
> update this issue.
> For this test, I injected two test URLs which never seemed to get parsed, 
> even though they are valid documents which are not excluded by any filters. I 
> use a http.content.limit of 64 MB, and tika is used for parsing documents. 
> Note that these are just two examples, I can provide more if needed.
> - 
> http://www.aanvalopschooluitval.nl/userfiles/file/projectenbank/Flex%20Lectoraat.pdf
> - 
> http://www.prettywoman-utrecht.nl/wp-content/uploads/PrettyWoman-methodiek_web.pdf
> Steps:
> 1) Whenever a batch gets generated, the GENERATE_MARK is set. So far so good.
> 2) During fetch, map() inside FetcherJob checks if this GENERATE_MARK is set. 
> If so, it continues. Still, so far so good.
> 3) After fetch, output() inside FetcherReducer sets the FETCH_MARK. I've 
> logged the marker, and it gets set with the correct batchId. It gets a value.
> 4) However, when another nutch command is run, all the markers from these 
> example URLs appear to have been erased. Not only is FETCH_MARK suddenly not 
> set, GENERATE_MARK is also erased. Thus, the parser will think the URL hasn't 
> been fetched yet. The fetchStatus, however, is nicely set to "2 
> (status_fetched)". It's just the markers that are not correctly set.
> My first assumption was that FETCH_MARK was not saved. However, as noted in 
> step 3), it gets the correct value. Also, GENERATE_MARK is erased after the 
> process is complete, so something else goes wrong. Somewhere before the end 
> of FetcherJob, the markers for certain pages are erased. Note that all other 
> values, like content, baseUrl, fetchtimes and fetchStatus, are saved 
> correctly for these URLs.
> Finally, for testing purposes, here is an example URL that DOES work: 
> http://www.aanvalopschooluitval.nl/userfiles/file/2011/Plusvoorzieningenkrant.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to