[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615063#comment-13615063
 ] 

Roland von Herget commented on NUTCH-1538:
------------------------------------------

Hi lufeng,

after reading a bit more of nutch code, the question arises if it is really 
necessary to load any of this ParserJob.FIELDS.
Shouldn't the fetcher set up all fields (all of "fit.page") necessary for the 
parser during the fetch?
I'll think I will give this a try here.

                
> tuning of loaded fields during fetcherJob start-up
> --------------------------------------------------
>
>                 Key: NUTCH-1538
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1538
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 2.1
>         Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>            Reporter: Roland von Herget
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to