[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-31 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13618536#comment-13618536
 ] 

lufeng commented on NUTCH-1538:
---

Hi Roland,
yes, i mean that may be 3rd part plugin will use these fields not only the 
content field. 

yes, Maybe if all generated urls have been crawled, read these contents 
actually take up a lot of time. but I'm also not sure what are the side effects 
if we comment these codes. i see that ParserJob#getFields method will load the 
parsePluginFields,htmlParsePluginFields and signaturePluginFields. so i have 
said that 3rd part plugin will load some fields in WebPage. I'll probably make 
a test. and other people has any comments. :)

> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland von Herget
> Attachments: NUTCH-1538-FetcherJob-v1.patch
>
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-28 Thread Roland von Herget (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616621#comment-13616621
 ] 

Roland von Herget commented on NUTCH-1538:
--

Hi Lufeng,

I'm not sure if I understood your point correctly, but if you mean that some 
3rd party plugin may use these fields:

1) In a normal workflow it would be like this:
- fetcher startup
- fetcher gets content via http and stores it to DB
- fetcher shutdown
- parser startup
- parser loads content from DB, parses, store parsed data in DB
- parser shutdown

2) In this discussed workflow (original code):
- fetcher startup
- fetcher loads content from DB
- fetcher gets _new_ content via http (overwriting loaded content from DB)
- fetcher runs parser and stores all in DB
- fetcher shutdown

With my patch, we just touch workflow 2) - skipping step 2 "loading content 
from db".
Every field we load in 2)/step 2 should be overwritten by step 3, if not 
workflow 1) can't work.

I know, this is not backed by a complete knowledge of the code, but from a 
logic point of view it makes sense to me ;)
Just my 2 cents.

> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland von Herget
> Attachments: NUTCH-1538-FetcherJob-v1.patch
>
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-28 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616250#comment-13616250
 ] 

lufeng commented on NUTCH-1538:
---

yes, However, we can not guarantee that other plugin that extended by user will 
be use to the corresponding field values​​ in WebPage class. 

> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland von Herget
> Attachments: NUTCH-1538-FetcherJob-v1.patch
>
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-27 Thread Roland von Herget (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615063#comment-13615063
 ] 

Roland von Herget commented on NUTCH-1538:
--

Hi lufeng,

after reading a bit more of nutch code, the question arises if it is really 
necessary to load any of this ParserJob.FIELDS.
Shouldn't the fetcher set up all fields (all of "fit.page") necessary for the 
parser during the fetch?
I'll think I will give this a try here.


> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland von Herget
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-05 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593223#comment-13593223
 ] 

lufeng commented on NUTCH-1538:
---

Hi Roland,

Maybe we can add a QueryFieldFilter to remove some field that never used in 
fetch if fetcher.parse proerpty is true.

> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-05 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593224#comment-13593224
 ] 

lufeng commented on NUTCH-1538:
---

Hi Roland,

Maybe we can add a QueryFieldFilter to remove some field that never used in 
fetch if fetcher.parse proerpty is true.

> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-04 Thread Roland (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593150#comment-13593150
 ] 

Roland commented on NUTCH-1538:
---

No, this just happens for the case fetcher.parse=true.
{code}
  public Collection getFields(Job job) {
Collection fields = new HashSet(FIELDS);
if (job.getConfiguration().getBoolean(PARSE_KEY, false)) {
  ParserJob parserJob = new ParserJob();
  fields.addAll(parserJob.getFields(job));
}
  [...]
{code}
A 'normal' fetcher should not be affected.

It would be a big improvement to get some kind more granular control which 
columns to load. (but it's an improvement, not a bug, I think)


> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1538) tuning of loaded fields during fetcherJob start-up

2013-03-04 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592524#comment-13592524
 ] 

Lewis John McGibbney commented on NUTCH-1538:
-

Thank god you got to the bottom of this one Roland. 
I never use a parsing fetcher.
Just to clarify, are you stating that the fields which result in slow loading 
are always loaded regardless of whether a parsing fetcher is used or not?
If this is not the case then no patch need be applied, however it is certainly 
something folks needs to be aware of IF they choose to use a parsing fetcher 
and to store content.
This one had me stumped.

> tuning of loaded fields during fetcherJob start-up
> --
>
> Key: NUTCH-1538
> URL: https://issues.apache.org/jira/browse/NUTCH-1538
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.1
> Environment: nutch 2.1 / cassandra 1.2.1 / gora-cassandra 0.2 / 
> gora-core 0.2.1 
> running fetch with parse=true
>Reporter: Roland
>
> Main problem is, nutch is loading nearly every row & column from DB during 
> startup of a fetcherJob when fetcher.parse=true.
> A parserJob needs e.g. the CONTENT field from db, to parse.
> The fetcherJob adds all fields of the parserJob to it's needed fields, if 
> running with fetcher.parse=true. [FetcherJob.getFields()]
> If the nutch configuration saves all fetched data to DB 
> (fetcher.store.content=true) you'll end up loading GBs of unused content 
> during fetcherJob start-up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira