Re: nutch with cassandra internal network usage
just for reference in the archives: https://issues.apache.org/jira/browse/NUTCH-1538 --Roland Am 04.03.2013 11:05, schrieb Julien Nioche: Hi Roland, Can you please open a JIRA for this? Thanks for investigating, the explanation makes a lot of sense Julien On 4 March 2013 07:26, Roland wrote: Hi all, I've read the sources ;) (no, not really all, but enough, I hope) So, major difference between generator & fetcher are the fields that it's loading from db. As I had fetcher.store.content=true in the beginning, there was a lot data in the contents fields. I run with fetcher.parse=true and that's why it loads all content during start-up of fetcherJob. I did this in my local 2.1 sources: Index: src/java/org/apache/nutch/**fetcher/FetcherJob.java ==**==**=== --- src/java/org/apache/nutch/**fetcher/FetcherJob.java (revision 1448112) +++ src/java/org/apache/nutch/**fetcher/FetcherJob.java (working copy) @@ -140,6 +140,8 @@ if (job.getConfiguration().**getBoolean(PARSE_KEY, false)) { ParserJob parserJob = new ParserJob(); fields.addAll(parserJob.**getFields(job)); + fields.remove(WebPage.Field.**CONTENT); // FIXME + fields.remove(WebPage.Field.**OUTLINKS); // FIXME } ProtocolFactory protocolFactory = new ProtocolFactory(job.** getConfiguration()); fields.addAll(protocolFactory.**getFields()); and now start-up time of an fetcherJob is about 10 minutes :) --Roland Am 22.02.2013 10:28, schrieb Roland: Hi Julien, ok, so thanks for the clarification, I think I have to read the sources :) --Roland Am 22.02.2013 10:10, schrieb Julien Nioche: Hi Roland My previous email should have started with "The point Alex is making is ..." and not just "The point is ...". I don't have an explanation as to why the generator is faster than the fetching as I don't use 2.x at all but it would definitely be interesting to find out. The behaviour of the fetcher is how I expect GORA to behave in its current form i.e. pull everything - filter - process. Julien On 21 February 2013 16:58, Roland wrote: Hi Julien, the point I personally don't get, is: why is generating fast - fetching not. If it's possible to filter the generatorJob at the backend (what I think it does), shouldn't it be possible to do the same for the fetcher? --Roland Am 21.02.2013 12:27, schrieb Julien Nioche: Lewis, The point is whether the filtering is done on the backend side (e.g. using queries, indices, etc...) then passed on to MapReduce via GORA or as I assume by looking at the code filtered within the MapReduce which means that all the entries are pulled from the backend anyway. This makes quite a difference in terms of performance if you think e.g about a large webtable which would have to be entirely passed to mapreduce even if only a handful of entries are to be processed. Makes sense? Julien On 21 February 2013 01:52, Lewis John Mcgibbney wrote: Those filters are applied only to URLs which do not have a null GENERATE_MARK e.g. if (Mark.GENERATE_MARK.checkMark(page) != null) { if (GeneratorJob.LOG.isDebugEnabled()) { GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); } return; Therefore filters will be applied to all URLs which have a null GENERATE_MARK value. On Wed, Feb 20, 2013 at 2:45 PM, wrote: Hi, Are those filters put on all data selected from hbase or sent to hbase as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, wrote: The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go as input to mappers. Also see this I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora. -- *Lewis*
Re: nutch with cassandra internal network usage
Hi Roland, Can you please open a JIRA for this? Thanks for investigating, the explanation makes a lot of sense Julien On 4 March 2013 07:26, Roland wrote: > Hi all, > > I've read the sources ;) > (no, not really all, but enough, I hope) > > So, major difference between generator & fetcher are the fields that it's > loading from db. > As I had fetcher.store.content=true in the beginning, there was a lot data > in the contents fields. > I run with fetcher.parse=true and that's why it loads all content during > start-up of fetcherJob. > > I did this in my local 2.1 sources: > Index: src/java/org/apache/nutch/**fetcher/FetcherJob.java > ==**==**=== > --- src/java/org/apache/nutch/**fetcher/FetcherJob.java (revision > 1448112) > +++ src/java/org/apache/nutch/**fetcher/FetcherJob.java (working copy) > @@ -140,6 +140,8 @@ > if (job.getConfiguration().**getBoolean(PARSE_KEY, false)) { >ParserJob parserJob = new ParserJob(); >fields.addAll(parserJob.**getFields(job)); > + fields.remove(WebPage.Field.**CONTENT); // FIXME > + fields.remove(WebPage.Field.**OUTLINKS); // FIXME > } > ProtocolFactory protocolFactory = new ProtocolFactory(job.** > getConfiguration()); > fields.addAll(protocolFactory.**getFields()); > > and now start-up time of an fetcherJob is about 10 minutes :) > > --Roland > > > Am 22.02.2013 10:28, schrieb Roland: > > Hi Julien, >> >> ok, so thanks for the clarification, I think I have to read the sources :) >> >> --Roland >> >> Am 22.02.2013 10:10, schrieb Julien Nioche: >> >>> Hi Roland >>> >>> My previous email should have started with "The point Alex is making is >>> ..." >>> and not just "The point is ...". >>> I don't have an explanation as to why the generator is faster than the >>> fetching as I don't use 2.x at all but it would definitely be interesting >>> to find out. The behaviour of the fetcher is how I expect GORA to behave >>> in >>> its current form i.e. pull everything - filter - process. >>> >>> Julien >>> >>> >>> On 21 February 2013 16:58, Roland wrote: >>> >>> Hi Julien, >>>> >>>> the point I personally don't get, is: why is generating fast - fetching >>>> not. >>>> If it's possible to filter the generatorJob at the backend (what I think >>>> it does), shouldn't it be possible to do the same for the fetcher? >>>> >>>> --Roland >>>> >>>> Am 21.02.2013 12:27, schrieb Julien Nioche: >>>> >>>> Lewis, >>>> >>>>> The point is whether the filtering is done on the backend side (e.g. >>>>> using >>>>> queries, indices, etc...) then passed on to MapReduce via GORA or as I >>>>> assume by looking at the code filtered within the MapReduce which means >>>>> that all the entries are pulled from the backend anyway. >>>>> This makes quite a difference in terms of performance if you think e.g >>>>> about a large webtable which would have to be entirely passed to >>>>> mapreduce >>>>> even if only a handful of entries are to be processed. >>>>> >>>>> Makes sense? >>>>> >>>>> Julien >>>>> >>>>> >>>>> On 21 February 2013 01:52, Lewis John Mcgibbney >>>>> wrote: >>>>> >>>>> Those filters are applied only to URLs which do not have a null >>>>> >>>>>> GENERATE_MARK >>>>>> e.g. >>>>>> >>>>>> if (Mark.GENERATE_MARK.checkMark(page) != null) { >>>>>> if (GeneratorJob.LOG.isDebugEnabled()) { >>>>>> GeneratorJob.LOG.debug("Skipping " + url + "; already >>>>>> generated"); >>>>>> } >>>>>> return; >>>>>> >>>>>> Therefore filters will be applied to all URLs which have a null >>>>>> GENERATE_MARK value. >>>>>> >>>>>> On Wed, Feb 20, 2013 at 2:45 PM, wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>>> Are those filters put on all data selected from hbase or sent to >>>>>>> hbase >
Re: nutch with cassandra internal network usage
Hi all, I've read the sources ;) (no, not really all, but enough, I hope) So, major difference between generator & fetcher are the fields that it's loading from db. As I had fetcher.store.content=true in the beginning, there was a lot data in the contents fields. I run with fetcher.parse=true and that's why it loads all content during start-up of fetcherJob. I did this in my local 2.1 sources: Index: src/java/org/apache/nutch/fetcher/FetcherJob.java === --- src/java/org/apache/nutch/fetcher/FetcherJob.java (revision 1448112) +++ src/java/org/apache/nutch/fetcher/FetcherJob.java (working copy) @@ -140,6 +140,8 @@ if (job.getConfiguration().getBoolean(PARSE_KEY, false)) { ParserJob parserJob = new ParserJob(); fields.addAll(parserJob.getFields(job)); + fields.remove(WebPage.Field.CONTENT); // FIXME + fields.remove(WebPage.Field.OUTLINKS); // FIXME } ProtocolFactory protocolFactory = new ProtocolFactory(job.getConfiguration()); fields.addAll(protocolFactory.getFields()); and now start-up time of an fetcherJob is about 10 minutes :) --Roland Am 22.02.2013 10:28, schrieb Roland: Hi Julien, ok, so thanks for the clarification, I think I have to read the sources :) --Roland Am 22.02.2013 10:10, schrieb Julien Nioche: Hi Roland My previous email should have started with "The point Alex is making is ..." and not just "The point is ...". I don't have an explanation as to why the generator is faster than the fetching as I don't use 2.x at all but it would definitely be interesting to find out. The behaviour of the fetcher is how I expect GORA to behave in its current form i.e. pull everything - filter - process. Julien On 21 February 2013 16:58, Roland wrote: Hi Julien, the point I personally don't get, is: why is generating fast - fetching not. If it's possible to filter the generatorJob at the backend (what I think it does), shouldn't it be possible to do the same for the fetcher? --Roland Am 21.02.2013 12:27, schrieb Julien Nioche: Lewis, The point is whether the filtering is done on the backend side (e.g. using queries, indices, etc...) then passed on to MapReduce via GORA or as I assume by looking at the code filtered within the MapReduce which means that all the entries are pulled from the backend anyway. This makes quite a difference in terms of performance if you think e.g about a large webtable which would have to be entirely passed to mapreduce even if only a handful of entries are to be processed. Makes sense? Julien On 21 February 2013 01:52, Lewis John Mcgibbney **wrote: Those filters are applied only to URLs which do not have a null GENERATE_MARK e.g. if (Mark.GENERATE_MARK.checkMark(**page) != null) { if (GeneratorJob.LOG.**isDebugEnabled()) { GeneratorJob.LOG.debug("**Skipping " + url + "; already generated"); } return; Therefore filters will be applied to all URLs which have a null GENERATE_MARK value. On Wed, Feb 20, 2013 at 2:45 PM, wrote: Hi, Are those filters put on all data selected from hbase or sent to hbase as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, wrote: The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go as input to mappers. Also see this I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora. -- *Lewis*
Re: nutch with cassandra internal network usage
Hi Julien, ok, so thanks for the clarification, I think I have to read the sources :) --Roland Am 22.02.2013 10:10, schrieb Julien Nioche: Hi Roland My previous email should have started with "The point Alex is making is ..." and not just "The point is ...". I don't have an explanation as to why the generator is faster than the fetching as I don't use 2.x at all but it would definitely be interesting to find out. The behaviour of the fetcher is how I expect GORA to behave in its current form i.e. pull everything - filter - process. Julien On 21 February 2013 16:58, Roland wrote: Hi Julien, the point I personally don't get, is: why is generating fast - fetching not. If it's possible to filter the generatorJob at the backend (what I think it does), shouldn't it be possible to do the same for the fetcher? --Roland Am 21.02.2013 12:27, schrieb Julien Nioche: Lewis, The point is whether the filtering is done on the backend side (e.g. using queries, indices, etc...) then passed on to MapReduce via GORA or as I assume by looking at the code filtered within the MapReduce which means that all the entries are pulled from the backend anyway. This makes quite a difference in terms of performance if you think e.g about a large webtable which would have to be entirely passed to mapreduce even if only a handful of entries are to be processed. Makes sense? Julien On 21 February 2013 01:52, Lewis John Mcgibbney **wrote: Those filters are applied only to URLs which do not have a null GENERATE_MARK e.g. if (Mark.GENERATE_MARK.checkMark(**page) != null) { if (GeneratorJob.LOG.**isDebugEnabled()) { GeneratorJob.LOG.debug("**Skipping " + url + "; already generated"); } return; Therefore filters will be applied to all URLs which have a null GENERATE_MARK value. On Wed, Feb 20, 2013 at 2:45 PM, wrote: Hi, Are those filters put on all data selected from hbase or sent to hbase as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, wrote: The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go as input to mappers. Also see this I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora. -- *Lewis*
Re: nutch with cassandra internal network usage
Hi Lewis, ok, first a few words about hardware: nutch is running on a 16 core AMD Opteron 2GHz. Cassandra is on an 8 core Intel Xeon 3.3 GHz, both have 128GB RAM and are connected via GBit network. Here is the timing of a generation job (after injecting 228007 urls): time ./bin/nutch generate GeneratorJob: Selecting best-scoring urls due for fetch. GeneratorJob: starting GeneratorJob: filtering: true GeneratorJob: done GeneratorJob: generated batch id: 1361519572-1552351269 real 16m26.089s user 3m3.303s sys 0m43.123s The fetcher job for this ID is now running since 70min and used 63min CPU time, the load of both servers is <1.0, but network traffic is somewhere at 180-200MBit/s as described before. Both servers are reacting fine, and doing a few other jobs without problems. The Terminal shows this: VM Started: FetcherJob: starting FetcherJob: batchId: 1361519572-1552351269 FetcherJob: threads: 30 FetcherJob: parsing: true FetcherJob: resuming: true FetcherJob : timelimit set for : -1 --Roland Am 22.02.2013 09:39, schrieb Lewis John Mcgibbney: Roland, i am curious to know exactly what is happening between the fetcherJob initiation and actual fetch of 1st URL. Does the terminal just hang? Can you track some metrics of the job? On Thursday, February 21, 2013, Roland wrote: Hi Julien, the point I personally don't get, is: why is generating fast - fetching not. If it's possible to filter the generatorJob at the backend (what I think it does), shouldn't it be possible to do the same for the fetcher? --Roland Am 21.02.2013 12:27, schrieb Julien Nioche: Lewis, The point is whether the filtering is done on the backend side (e.g. using queries, indices, etc...) then passed on to MapReduce via GORA or as I assume by looking at the code filtered within the MapReduce which means that all the entries are pulled from the backend anyway. This makes quite a difference in terms of performance if you think e.g about a large webtable which would have to be entirely passed to mapreduce even if only a handful of entries are to be processed. Makes sense? Julien On 21 February 2013 01:52, Lewis John Mcgibbney wrote: Those filters are applied only to URLs which do not have a null GENERATE_MARK e.g. if (Mark.GENERATE_MARK.checkMark(page) != null) { if (GeneratorJob.LOG.isDebugEnabled()) { GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); } return; Therefore filters will be applied to all URLs which have a null GENERATE_MARK value. On Wed, Feb 20, 2013 at 2:45 PM, wrote: Hi, Are those filters put on all data selected from hbase or sent to hbase as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, wrote: The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go as input to mappers. Also see this I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora. -- *Lewis*
Re: nutch with cassandra internal network usage
Hi Roland My previous email should have started with "The point Alex is making is ..." and not just "The point is ...". I don't have an explanation as to why the generator is faster than the fetching as I don't use 2.x at all but it would definitely be interesting to find out. The behaviour of the fetcher is how I expect GORA to behave in its current form i.e. pull everything - filter - process. Julien On 21 February 2013 16:58, Roland wrote: > Hi Julien, > > the point I personally don't get, is: why is generating fast - fetching > not. > If it's possible to filter the generatorJob at the backend (what I think > it does), shouldn't it be possible to do the same for the fetcher? > > --Roland > > Am 21.02.2013 12:27, schrieb Julien Nioche: > > Lewis, >> >> The point is whether the filtering is done on the backend side (e.g. using >> queries, indices, etc...) then passed on to MapReduce via GORA or as I >> assume by looking at the code filtered within the MapReduce which means >> that all the entries are pulled from the backend anyway. >> This makes quite a difference in terms of performance if you think e.g >> about a large webtable which would have to be entirely passed to mapreduce >> even if only a handful of entries are to be processed. >> >> Makes sense? >> >> Julien >> >> >> On 21 February 2013 01:52, Lewis John Mcgibbney >> **wrote: >> >> Those filters are applied only to URLs which do not have a null >>> GENERATE_MARK >>> e.g. >>> >>> if (Mark.GENERATE_MARK.checkMark(**page) != null) { >>>if (GeneratorJob.LOG.**isDebugEnabled()) { >>> GeneratorJob.LOG.debug("**Skipping " + url + "; already >>> generated"); >>>} >>>return; >>> >>> Therefore filters will be applied to all URLs which have a null >>> GENERATE_MARK value. >>> >>> On Wed, Feb 20, 2013 at 2:45 PM, wrote: >>> >>> Hi, >>>> >>>> Are those filters put on all data selected from hbase or sent to hbase >>>> as >>>> filters to select a subset of all hbase records? >>>> >>>> Thanks. >>>> Alex. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -Original Message- >>>> From: Lewis John Mcgibbney >>>> To: user >>>> Sent: Wed, Feb 20, 2013 12:56 pm >>>> Subject: Re: nutch with cassandra internal network usage >>>> >>>> >>>> Hi Alex, >>>> >>>> On Wed, Feb 20, 2013 at 11:54 AM, wrote: >>>> >>>> The generator also does not have filters. Its mapper goes over all >>>>> records as far as I know. If you use hadoop you can see how many >>>>> >>>> records >>> >>>> go >>>> >>>>> as input to mappers. Also see this >>>>> >>>>> I don't think this is true. The GeneratorMapper filters URLs before >>>> selecting them for inclusion based on the following >>>> - distance >>>> - URLNormalizer(s) >>>> - URLFilter(s) >>>> in that order. >>>> I am going to start a new thread on improvements to the GeneratorJob >>>> regarding better configuration as it is a crucial stage in the crawl >>>> process. >>>> >>>> So the issue here, as you correctly explain, is with the Fetcher >>>> >>> obtaining >>> >>>> the URLs which have been marked with a desired batchId. This would be >>>> >>> done >>> >>>> via scanner in Gora. >>>> >>>> >>>> >>>> >>> -- >>> *Lewis* >>> >>> >> >> > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: nutch with cassandra internal network usage
Roland, i am curious to know exactly what is happening between the fetcherJob initiation and actual fetch of 1st URL. Does the terminal just hang? Can you track some metrics of the job? On Thursday, February 21, 2013, Roland wrote: > Hi Julien, > > the point I personally don't get, is: why is generating fast - fetching not. > If it's possible to filter the generatorJob at the backend (what I think it does), shouldn't it be possible to do the same for the fetcher? > > --Roland > > Am 21.02.2013 12:27, schrieb Julien Nioche: >> >> Lewis, >> >> The point is whether the filtering is done on the backend side (e.g. using >> queries, indices, etc...) then passed on to MapReduce via GORA or as I >> assume by looking at the code filtered within the MapReduce which means >> that all the entries are pulled from the backend anyway. >> This makes quite a difference in terms of performance if you think e.g >> about a large webtable which would have to be entirely passed to mapreduce >> even if only a handful of entries are to be processed. >> >> Makes sense? >> >> Julien >> >> >> On 21 February 2013 01:52, Lewis John Mcgibbney >> wrote: >> >>> Those filters are applied only to URLs which do not have a null >>> GENERATE_MARK >>> e.g. >>> >>> if (Mark.GENERATE_MARK.checkMark(page) != null) { >>>if (GeneratorJob.LOG.isDebugEnabled()) { >>> GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); >>>} >>>return; >>> >>> Therefore filters will be applied to all URLs which have a null >>> GENERATE_MARK value. >>> >>> On Wed, Feb 20, 2013 at 2:45 PM, wrote: >>> >>>> Hi, >>>> >>>> Are those filters put on all data selected from hbase or sent to hbase as >>>> filters to select a subset of all hbase records? >>>> >>>> Thanks. >>>> Alex. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -Original Message- >>>> From: Lewis John Mcgibbney >>>> To: user >>>> Sent: Wed, Feb 20, 2013 12:56 pm >>>> Subject: Re: nutch with cassandra internal network usage >>>> >>>> >>>> Hi Alex, >>>> >>>> On Wed, Feb 20, 2013 at 11:54 AM, wrote: >>>> >>>>> The generator also does not have filters. Its mapper goes over all >>>>> records as far as I know. If you use hadoop you can see how many >>> >>> records >>>> >>>> go >>>>> >>>>> as input to mappers. Also see this >>>>> >>>> I don't think this is true. The GeneratorMapper filters URLs before >>>> selecting them for inclusion based on the following >>>> - distance >>>> - URLNormalizer(s) >>>> - URLFilter(s) >>>> in that order. >>>> I am going to start a new thread on improvements to the GeneratorJob >>>> regarding better configuration as it is a crucial stage in the crawl >>>> process. >>>> >>>> So the issue here, as you correctly explain, is with the Fetcher >>> >>> obtaining >>>> >>>> the URLs which have been marked with a desired batchId. This would be >>> >>> done >>>> >>>> via scanner in Gora. >>>> >>>> >>>> >>> >>> -- >>> *Lewis* >>> >> >> > > -- *Lewis*
Re: nutch with cassandra internal network usage
Hi Julien, the point I personally don't get, is: why is generating fast - fetching not. If it's possible to filter the generatorJob at the backend (what I think it does), shouldn't it be possible to do the same for the fetcher? --Roland Am 21.02.2013 12:27, schrieb Julien Nioche: Lewis, The point is whether the filtering is done on the backend side (e.g. using queries, indices, etc...) then passed on to MapReduce via GORA or as I assume by looking at the code filtered within the MapReduce which means that all the entries are pulled from the backend anyway. This makes quite a difference in terms of performance if you think e.g about a large webtable which would have to be entirely passed to mapreduce even if only a handful of entries are to be processed. Makes sense? Julien On 21 February 2013 01:52, Lewis John Mcgibbney wrote: Those filters are applied only to URLs which do not have a null GENERATE_MARK e.g. if (Mark.GENERATE_MARK.checkMark(page) != null) { if (GeneratorJob.LOG.isDebugEnabled()) { GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); } return; Therefore filters will be applied to all URLs which have a null GENERATE_MARK value. On Wed, Feb 20, 2013 at 2:45 PM, wrote: Hi, Are those filters put on all data selected from hbase or sent to hbase as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, wrote: The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go as input to mappers. Also see this I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora. -- *Lewis*
Re: nutch with cassandra internal network usage
I get it fine. I do think it important to discuss the current filtering code in the generator though. Yeah, okay, it turns out that our current implementation (which reads all entries then does filtering on Nutch side) can be horribly expensive but at least there is some mechanism in place right? We will work on the scan over in Gora after 0.3 release. Null unions in avro schemas e.g. GORA-174 has been kicking our head in but we are getting there. As always, anyone interested in contributing to the cause, please shoot over to user@gora Thanks Lewis On Thursday, February 21, 2013, Julien Nioche wrote: > Lewis, > > The point is whether the filtering is done on the backend side (e.g. using > queries, indices, etc...) then passed on to MapReduce via GORA or as I > assume by looking at the code filtered within the MapReduce which means > that all the entries are pulled from the backend anyway. > This makes quite a difference in terms of performance if you think e.g > about a large webtable which would have to be entirely passed to mapreduce > even if only a handful of entries are to be processed. > > Makes sense? > > Julien > > > On 21 February 2013 01:52, Lewis John Mcgibbney > wrote: > >> Those filters are applied only to URLs which do not have a null >> GENERATE_MARK >> e.g. >> >> if (Mark.GENERATE_MARK.checkMark(page) != null) { >> if (GeneratorJob.LOG.isDebugEnabled()) { >> GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); >> } >> return; >> >> Therefore filters will be applied to all URLs which have a null >> GENERATE_MARK value. >> >> On Wed, Feb 20, 2013 at 2:45 PM, wrote: >> >> > Hi, >> > >> > Are those filters put on all data selected from hbase or sent to hbase as >> > filters to select a subset of all hbase records? >> > >> > Thanks. >> > Alex. >> > >> > >> > >> > >> > >> > >> > >> > -Original Message- >> > From: Lewis John Mcgibbney >> > To: user >> > Sent: Wed, Feb 20, 2013 12:56 pm >> > Subject: Re: nutch with cassandra internal network usage >> > >> > >> > Hi Alex, >> > >> > On Wed, Feb 20, 2013 at 11:54 AM, wrote: >> > >> > > >> > > The generator also does not have filters. Its mapper goes over all >> > > records as far as I know. If you use hadoop you can see how many >> records >> > go >> > > as input to mappers. Also see this >> > > >> > >> > I don't think this is true. The GeneratorMapper filters URLs before >> > selecting them for inclusion based on the following >> > - distance >> > - URLNormalizer(s) >> > - URLFilter(s) >> > in that order. >> > I am going to start a new thread on improvements to the GeneratorJob >> > regarding better configuration as it is a crucial stage in the crawl >> > process. >> > >> > So the issue here, as you correctly explain, is with the Fetcher >> obtaining >> > the URLs which have been marked with a desired batchId. This would be >> done >> > via scanner in Gora. >> > >> > >> > >> >> >> -- >> *Lewis* >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- *Lewis*
Re: nutch with cassandra internal network usage
Lewis, The point is whether the filtering is done on the backend side (e.g. using queries, indices, etc...) then passed on to MapReduce via GORA or as I assume by looking at the code filtered within the MapReduce which means that all the entries are pulled from the backend anyway. This makes quite a difference in terms of performance if you think e.g about a large webtable which would have to be entirely passed to mapreduce even if only a handful of entries are to be processed. Makes sense? Julien On 21 February 2013 01:52, Lewis John Mcgibbney wrote: > Those filters are applied only to URLs which do not have a null > GENERATE_MARK > e.g. > > if (Mark.GENERATE_MARK.checkMark(page) != null) { > if (GeneratorJob.LOG.isDebugEnabled()) { > GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); > } > return; > > Therefore filters will be applied to all URLs which have a null > GENERATE_MARK value. > > On Wed, Feb 20, 2013 at 2:45 PM, wrote: > > > Hi, > > > > Are those filters put on all data selected from hbase or sent to hbase as > > filters to select a subset of all hbase records? > > > > Thanks. > > Alex. > > > > > > > > > > > > > > > > -----Original Message- > > From: Lewis John Mcgibbney > > To: user > > Sent: Wed, Feb 20, 2013 12:56 pm > > Subject: Re: nutch with cassandra internal network usage > > > > > > Hi Alex, > > > > On Wed, Feb 20, 2013 at 11:54 AM, wrote: > > > > > > > > The generator also does not have filters. Its mapper goes over all > > > records as far as I know. If you use hadoop you can see how many > records > > go > > > as input to mappers. Also see this > > > > > > > I don't think this is true. The GeneratorMapper filters URLs before > > selecting them for inclusion based on the following > > - distance > > - URLNormalizer(s) > > - URLFilter(s) > > in that order. > > I am going to start a new thread on improvements to the GeneratorJob > > regarding better configuration as it is a crucial stage in the crawl > > process. > > > > So the issue here, as you correctly explain, is with the Fetcher > obtaining > > the URLs which have been marked with a desired batchId. This would be > done > > via scanner in Gora. > > > > > > > > > -- > *Lewis* > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: nutch with cassandra internal network usage
Those filters are applied only to URLs which do not have a null GENERATE_MARK e.g. if (Mark.GENERATE_MARK.checkMark(page) != null) { if (GeneratorJob.LOG.isDebugEnabled()) { GeneratorJob.LOG.debug("Skipping " + url + "; already generated"); } return; Therefore filters will be applied to all URLs which have a null GENERATE_MARK value. On Wed, Feb 20, 2013 at 2:45 PM, wrote: > Hi, > > Are those filters put on all data selected from hbase or sent to hbase as > filters to select a subset of all hbase records? > > Thanks. > Alex. > > > > > > > > -Original Message- > From: Lewis John Mcgibbney > To: user > Sent: Wed, Feb 20, 2013 12:56 pm > Subject: Re: nutch with cassandra internal network usage > > > Hi Alex, > > On Wed, Feb 20, 2013 at 11:54 AM, wrote: > > > > > The generator also does not have filters. Its mapper goes over all > > records as far as I know. If you use hadoop you can see how many records > go > > as input to mappers. Also see this > > > > I don't think this is true. The GeneratorMapper filters URLs before > selecting them for inclusion based on the following > - distance > - URLNormalizer(s) > - URLFilter(s) > in that order. > I am going to start a new thread on improvements to the GeneratorJob > regarding better configuration as it is a crucial stage in the crawl > process. > > So the issue here, as you correctly explain, is with the Fetcher obtaining > the URLs which have been marked with a desired batchId. This would be done > via scanner in Gora. > > > -- *Lewis*
Re: nutch with cassandra internal network usage
Hi, Are those filters put on all data selected from hbase or sent to hbase as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, wrote: > > The generator also does not have filters. Its mapper goes over all > records as far as I know. If you use hadoop you can see how many records go > as input to mappers. Also see this > I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora.
Re: nutch with cassandra internal network usage
Hi, Please head over to most recent thread on dev@ for potential improvements for the Generator* code. Thanks for invoking this discussion, it is well overdue. Lewis On Wed, Feb 20, 2013 at 12:55 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Alex, > > > On Wed, Feb 20, 2013 at 11:54 AM, wrote: > >> >> The generator also does not have filters. Its mapper goes over all >> records as far as I know. If you use hadoop you can see how many records go >> as input to mappers. Also see this >> > > I don't think this is true. The GeneratorMapper filters URLs before > selecting them for inclusion based on the following > - distance > - URLNormalizer(s) > - URLFilter(s) > in that order. > I am going to start a new thread on improvements to the GeneratorJob > regarding better configuration as it is a crucial stage in the crawl > process. > > So the issue here, as you correctly explain, is with the Fetcher obtaining > the URLs which have been marked with a desired batchId. This would be done > via scanner in Gora. > -- *Lewis*
Re: nutch with cassandra internal network usage
Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, wrote: > > The generator also does not have filters. Its mapper goes over all > records as far as I know. If you use hadoop you can see how many records go > as input to mappers. Also see this > I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora.
Re: nutch with cassandra internal network usage
The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go as input to mappers. Also see this https://issues.apache.org/jira/browse/GORA-119 Alex. -Original Message- From: Roland To: user Sent: Wed, Feb 20, 2013 11:47 am Subject: Re: nutch with cassandra internal network usage Hi Alex, the GeneratorJob seems to have a solution for that, if not it would iterate over all records too, am I right? --Roland Am 20.02.2013 20:42, schrieb alx...@aim.com: > Hi, > > This is because fetch's mapper goes over all records and selects those that has the given batchId. Currently mappers of all nutch commands does not have filters. > It is interesting to know if you can selects records with a given batchId in cassandra without iterating over all records. > > > Alex. >
Re: nutch with cassandra internal network usage
Hi Alex, the GeneratorJob seems to have a solution for that, if not it would iterate over all records too, am I right? --Roland Am 20.02.2013 20:42, schrieb alx...@aim.com: Hi, This is because fetch's mapper goes over all records and selects those that has the given batchId. Currently mappers of all nutch commands does not have filters. It is interesting to know if you can selects records with a given batchId in cassandra without iterating over all records. Alex.
Re: nutch with cassandra internal network usage
Hi, This is because fetch's mapper goes over all records and selects those that has the given batchId. Currently mappers of all nutch commands does not have filters. It is interesting to know if you can selects records with a given batchId in cassandra without iterating over all records. Alex. -Original Message- From: Roland To: user Sent: Wed, Feb 20, 2013 10:56 am Subject: Re: nutch with cassandra internal network usage Hi Lewis, the GeneratorJob takes only ~5 minutes. I'm running it in standalone mode, like this: ./bin/nutch fetch 1361367698-1708119958 -threads 40 It's configured to fetch & parse, but it makes no difference if it only fetches: FetcherJob: starting FetcherJob: batchId: 1361367698-1708119958 FetcherJob: threads: 40 FetcherJob: parsing: true FetcherJob: resuming: false FetcherJob : timelimit set for : -1 --Roland Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney: > Hi Roland, > > You say you start a fetch run, does this mean the FetcherJob or > GeneratorJob? What kind of settings do you run your zNutch server with?
Re: nutch with cassandra internal network usage
I am assuming that your generate.max.count property value is set to the default -1? Have you tried configuring more, smaller batchId's (fetch lists)? I don't have an immediate answer as to why overall, the FetcherJob is taking this amount of time and resources On Wednesday, February 20, 2013, Roland wrote: > Hi Lewis, > > the GeneratorJob takes only ~5 minutes. > I'm running it in standalone mode, like this: > ./bin/nutch fetch 1361367698-1708119958 -threads 40 > > It's configured to fetch & parse, but it makes no difference if it only fetches: > FetcherJob: starting > FetcherJob: batchId: 1361367698-1708119958 > FetcherJob: threads: 40 > FetcherJob: parsing: true > FetcherJob: resuming: false > FetcherJob : timelimit set for : -1 > > --Roland > > > Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney: >> >> Hi Roland, >> >> You say you start a fetch run, does this mean the FetcherJob or >> GeneratorJob? What kind of settings do you run your zNutch server with? > -- *Lewis*
Re: nutch with cassandra internal network usage
Hi Lewis, the GeneratorJob takes only ~5 minutes. I'm running it in standalone mode, like this: ./bin/nutch fetch 1361367698-1708119958 -threads 40 It's configured to fetch & parse, but it makes no difference if it only fetches: FetcherJob: starting FetcherJob: batchId: 1361367698-1708119958 FetcherJob: threads: 40 FetcherJob: parsing: true FetcherJob: resuming: false FetcherJob : timelimit set for : -1 --Roland Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney: Hi Roland, You say you start a fetch run, does this mean the FetcherJob or GeneratorJob? What kind of settings do you run your zNutch server with?
Re: nutch with cassandra internal network usage
Hi Roland, You say you start a fetch run, does this mean the FetcherJob or GeneratorJob? What kind of settings do you run your zNutch server with? On Wednesday, February 20, 2013, Roland wrote: > Hi list, > > we're experimenting with nutch 2.1 and cassandra 1.2.1 (on ? hosts). > Our cassandra 'webpage' store has about 31GB right now on disk, we add URLs by 'injecting' them, about 100k-300k per cycle. > When starting a 'fetch' run, it now needs about an hour before the queues are set up / the first page is fetched. > During this time we can see about 180MBit/s network traffic from the cassandra host to the nutch host (outgoing of cassandra). > If I calculate the transferred data during this time (taking only 150Mbit/s into account): > 150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB > > So, why does nutch load all data from the db, and not only the relevant data of this fetch? And why does it happen twice? > > Thanks, > Roland > -- *Lewis*
nutch with cassandra internal network usage
Hi list, we're experimenting with nutch 2.1 and cassandra 1.2.1 (on different hosts). Our cassandra 'webpage' store has about 31GB right now on disk, we add URLs by 'injecting' them, about 100k-300k per cycle. When starting a 'fetch' run, it now needs about an hour before the queues are set up / the first page is fetched. During this time we can see about 180MBit/s network traffic from the cassandra host to the nutch host (outgoing of cassandra). If I calculate the transferred data during this time (taking only 150Mbit/s into account): 150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB So, why does nutch load all data from the db, and not only the relevant data of this fetch? And why does it happen twice? Thanks, Roland