Re: nutch with cassandra internal network usage

2013-03-04 Thread Roland

just for reference in the archives:
https://issues.apache.org/jira/browse/NUTCH-1538

--Roland

Am 04.03.2013 11:05, schrieb Julien Nioche:

Hi Roland,

Can you please open a JIRA for this? Thanks for investigating, the
explanation makes a lot of sense

Julien

On 4 March 2013 07:26, Roland  wrote:


Hi all,

I've read the sources ;)
(no, not really all, but enough, I hope)

So, major difference between generator & fetcher are the fields that it's
loading from db.
As I had fetcher.store.content=true in the beginning, there was a lot data
in the contents fields.
I run with fetcher.parse=true and that's why it loads all content during
start-up of fetcherJob.

I did this in my local 2.1 sources:
Index: src/java/org/apache/nutch/**fetcher/FetcherJob.java
==**==**===
--- src/java/org/apache/nutch/**fetcher/FetcherJob.java   (revision
1448112)
+++ src/java/org/apache/nutch/**fetcher/FetcherJob.java   (working copy)
@@ -140,6 +140,8 @@
  if (job.getConfiguration().**getBoolean(PARSE_KEY, false)) {
ParserJob parserJob = new ParserJob();
fields.addAll(parserJob.**getFields(job));
+  fields.remove(WebPage.Field.**CONTENT); // FIXME
+  fields.remove(WebPage.Field.**OUTLINKS); // FIXME
  }
  ProtocolFactory protocolFactory = new ProtocolFactory(job.**
getConfiguration());
  fields.addAll(protocolFactory.**getFields());

and now start-up time of an fetcherJob is about 10 minutes :)

--Roland


Am 22.02.2013 10:28, schrieb Roland:

  Hi Julien,

ok, so thanks for the clarification, I think I have to read the sources :)

--Roland

Am 22.02.2013 10:10, schrieb Julien Nioche:


Hi Roland

My previous email should have started with "The point Alex is making is
..."
and not just "The point is ...".
I don't have an explanation as to why the generator is faster than the
fetching as I don't use 2.x at all but it would definitely be interesting
to find out. The behaviour of the fetcher is how I expect GORA to behave
in
its current form i.e. pull everything - filter - process.

Julien


On 21 February 2013 16:58, Roland  wrote:

  Hi Julien,

the point I personally don't get, is: why is generating fast - fetching
not.
If it's possible to filter the generatorJob at the backend (what I think
it does), shouldn't it be possible to do the same for the fetcher?

--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:

   Lewis,


The point is whether the filtering is done on the backend side (e.g.
using
queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to
mapreduce
even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
wrote:

   Those filters are applied only to URLs which do not have a null


GENERATE_MARK
e.g.

   if (Mark.GENERATE_MARK.checkMark(page) != null) {
 if (GeneratorJob.LOG.isDebugEnabled()) {
   GeneratorJob.LOG.debug("Skipping " + url + "; already
generated");
 }
 return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM,  wrote:

   Hi,


Are those filters put on all data selected from hbase or sent to
hbase
as
filters to select a subset of all hbase records?

Thanks.
Alex.







-Original Message-
From: Lewis John Mcgibbney 
To: user 
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM,  wrote:

   The generator also does not have filters. Its mapper goes over all


records as far as I know. If you use hadoop you can see how many

  records

go

  as input to mappers. Also see this

   I don't think this is true. The GeneratorMapper filters URLs
before


selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher

  obtaining

  the URLs which have been marked with a desired batchId. This would be

  done

  via scanner in Gora.




  --

*Lewis*









Re: nutch with cassandra internal network usage

2013-03-04 Thread Julien Nioche
Hi Roland,

Can you please open a JIRA for this? Thanks for investigating, the
explanation makes a lot of sense

Julien

On 4 March 2013 07:26, Roland  wrote:

> Hi all,
>
> I've read the sources ;)
> (no, not really all, but enough, I hope)
>
> So, major difference between generator & fetcher are the fields that it's
> loading from db.
> As I had fetcher.store.content=true in the beginning, there was a lot data
> in the contents fields.
> I run with fetcher.parse=true and that's why it loads all content during
> start-up of fetcherJob.
>
> I did this in my local 2.1 sources:
> Index: src/java/org/apache/nutch/**fetcher/FetcherJob.java
> ==**==**===
> --- src/java/org/apache/nutch/**fetcher/FetcherJob.java   (revision
> 1448112)
> +++ src/java/org/apache/nutch/**fetcher/FetcherJob.java   (working copy)
> @@ -140,6 +140,8 @@
>  if (job.getConfiguration().**getBoolean(PARSE_KEY, false)) {
>ParserJob parserJob = new ParserJob();
>fields.addAll(parserJob.**getFields(job));
> +  fields.remove(WebPage.Field.**CONTENT); // FIXME
> +  fields.remove(WebPage.Field.**OUTLINKS); // FIXME
>  }
>  ProtocolFactory protocolFactory = new ProtocolFactory(job.**
> getConfiguration());
>  fields.addAll(protocolFactory.**getFields());
>
> and now start-up time of an fetcherJob is about 10 minutes :)
>
> --Roland
>
>
> Am 22.02.2013 10:28, schrieb Roland:
>
>  Hi Julien,
>>
>> ok, so thanks for the clarification, I think I have to read the sources :)
>>
>> --Roland
>>
>> Am 22.02.2013 10:10, schrieb Julien Nioche:
>>
>>> Hi Roland
>>>
>>> My previous email should have started with "The point Alex is making is
>>> ..."
>>> and not just "The point is ...".
>>> I don't have an explanation as to why the generator is faster than the
>>> fetching as I don't use 2.x at all but it would definitely be interesting
>>> to find out. The behaviour of the fetcher is how I expect GORA to behave
>>> in
>>> its current form i.e. pull everything - filter - process.
>>>
>>> Julien
>>>
>>>
>>> On 21 February 2013 16:58, Roland  wrote:
>>>
>>>  Hi Julien,
>>>>
>>>> the point I personally don't get, is: why is generating fast - fetching
>>>> not.
>>>> If it's possible to filter the generatorJob at the backend (what I think
>>>> it does), shouldn't it be possible to do the same for the fetcher?
>>>>
>>>> --Roland
>>>>
>>>> Am 21.02.2013 12:27, schrieb Julien Nioche:
>>>>
>>>>   Lewis,
>>>>
>>>>> The point is whether the filtering is done on the backend side (e.g.
>>>>> using
>>>>> queries, indices, etc...) then passed on to MapReduce via GORA or as I
>>>>> assume by looking at the code filtered within the MapReduce which means
>>>>> that all the entries are pulled from the backend anyway.
>>>>> This makes quite a difference in terms of performance if you think e.g
>>>>> about a large webtable which would have to be entirely passed to
>>>>> mapreduce
>>>>> even if only a handful of entries are to be processed.
>>>>>
>>>>> Makes sense?
>>>>>
>>>>> Julien
>>>>>
>>>>>
>>>>> On 21 February 2013 01:52, Lewis John Mcgibbney
>>>>> wrote:
>>>>>
>>>>>   Those filters are applied only to URLs which do not have a null
>>>>>
>>>>>> GENERATE_MARK
>>>>>> e.g.
>>>>>>
>>>>>>   if (Mark.GENERATE_MARK.checkMark(page) != null) {
>>>>>> if (GeneratorJob.LOG.isDebugEnabled()) {
>>>>>>   GeneratorJob.LOG.debug("Skipping " + url + "; already
>>>>>> generated");
>>>>>> }
>>>>>> return;
>>>>>>
>>>>>> Therefore filters will be applied to all URLs which have a null
>>>>>> GENERATE_MARK value.
>>>>>>
>>>>>> On Wed, Feb 20, 2013 at 2:45 PM,  wrote:
>>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>>> Are those filters put on all data selected from hbase or sent to
>>>>>>> hbase
>

Re: nutch with cassandra internal network usage

2013-03-03 Thread Roland

Hi all,

I've read the sources ;)
(no, not really all, but enough, I hope)

So, major difference between generator & fetcher are the fields that 
it's loading from db.
As I had fetcher.store.content=true in the beginning, there was a lot 
data in the contents fields.
I run with fetcher.parse=true and that's why it loads all content during 
start-up of fetcherJob.


I did this in my local 2.1 sources:
Index: src/java/org/apache/nutch/fetcher/FetcherJob.java
===
--- src/java/org/apache/nutch/fetcher/FetcherJob.java   (revision 1448112)
+++ src/java/org/apache/nutch/fetcher/FetcherJob.java   (working copy)
@@ -140,6 +140,8 @@
 if (job.getConfiguration().getBoolean(PARSE_KEY, false)) {
   ParserJob parserJob = new ParserJob();
   fields.addAll(parserJob.getFields(job));
+  fields.remove(WebPage.Field.CONTENT); // FIXME
+  fields.remove(WebPage.Field.OUTLINKS); // FIXME
 }
 ProtocolFactory protocolFactory = new 
ProtocolFactory(job.getConfiguration());

 fields.addAll(protocolFactory.getFields());

and now start-up time of an fetcherJob is about 10 minutes :)

--Roland


Am 22.02.2013 10:28, schrieb Roland:

Hi Julien,

ok, so thanks for the clarification, I think I have to read the 
sources :)


--Roland

Am 22.02.2013 10:10, schrieb Julien Nioche:

Hi Roland

My previous email should have started with "The point Alex is making 
is ..."

and not just "The point is ...".
I don't have an explanation as to why the generator is faster than the
fetching as I don't use 2.x at all but it would definitely be 
interesting
to find out. The behaviour of the fetcher is how I expect GORA to 
behave in

its current form i.e. pull everything - filter - process.

Julien


On 21 February 2013 16:58, Roland  wrote:


Hi Julien,

the point I personally don't get, is: why is generating fast - fetching
not.
If it's possible to filter the generatorJob at the backend (what I 
think

it does), shouldn't it be possible to do the same for the fetcher?

--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:

  Lewis,
The point is whether the filtering is done on the backend side 
(e.g. using

queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which 
means

that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to 
mapreduce

even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
**wrote:

  Those filters are applied only to URLs which do not have a null

GENERATE_MARK
e.g.

  if (Mark.GENERATE_MARK.checkMark(**page) != null) {
if (GeneratorJob.LOG.**isDebugEnabled()) {
  GeneratorJob.LOG.debug("**Skipping " + url + "; already
generated");
}
return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM,  wrote:

  Hi,
Are those filters put on all data selected from hbase or sent to 
hbase

as
filters to select a subset of all hbase records?

Thanks.
Alex.







-Original Message-
From: Lewis John Mcgibbney 
To: user 
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM,  wrote:

  The generator also does not have filters. Its mapper goes over all

records as far as I know. If you use hadoop you can see how many


records
go


as input to mappers. Also see this

  I don't think this is true. The GeneratorMapper filters URLs 
before

selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher


obtaining

the URLs which have been marked with a desired batchId. This 
would be



done


via scanner in Gora.





--
*Lewis*












Re: nutch with cassandra internal network usage

2013-02-22 Thread Roland

Hi Julien,

ok, so thanks for the clarification, I think I have to read the sources :)

--Roland

Am 22.02.2013 10:10, schrieb Julien Nioche:

Hi Roland

My previous email should have started with "The point Alex is making is ..."
and not just "The point is ...".
I don't have an explanation as to why the generator is faster than the
fetching as I don't use 2.x at all but it would definitely be interesting
to find out. The behaviour of the fetcher is how I expect GORA to behave in
its current form i.e. pull everything - filter - process.

Julien


On 21 February 2013 16:58, Roland  wrote:


Hi Julien,

the point I personally don't get, is: why is generating fast - fetching
not.
If it's possible to filter the generatorJob at the backend (what I think
it does), shouldn't it be possible to do the same for the fetcher?

--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:

  Lewis,

The point is whether the filtering is done on the backend side (e.g. using
queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to mapreduce
even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
**wrote:

  Those filters are applied only to URLs which do not have a null

GENERATE_MARK
e.g.

  if (Mark.GENERATE_MARK.checkMark(**page) != null) {
if (GeneratorJob.LOG.**isDebugEnabled()) {
  GeneratorJob.LOG.debug("**Skipping " + url + "; already
generated");
}
return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM,  wrote:

  Hi,

Are those filters put on all data selected from hbase or sent to hbase
as
filters to select a subset of all hbase records?

Thanks.
Alex.







-Original Message-
From: Lewis John Mcgibbney 
To: user 
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM,  wrote:

  The generator also does not have filters. Its mapper  goes over all

records as far as I know. If you use hadoop you can see how many


records
go


as input to mappers. Also see this

  I don't think this is true. The GeneratorMapper filters URLs before

selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher


obtaining


the URLs which have been marked with a desired batchId. This would be


done


via scanner in Gora.





--
*Lewis*










Re: nutch with cassandra internal network usage

2013-02-22 Thread Roland

Hi Lewis,

ok, first a few words about hardware: nutch is running on a 16 core AMD 
Opteron 2GHz.
Cassandra is on an 8 core Intel Xeon 3.3 GHz, both have 128GB RAM and 
are connected via GBit network.


Here is the timing of a generation job (after injecting 228007 urls):
time ./bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1361519572-1552351269

real 16m26.089s
user 3m3.303s
sys 0m43.123s

The fetcher job for this ID is now running since 70min and used 63min 
CPU time, the load of both servers is <1.0, but network traffic is 
somewhere at 180-200MBit/s as described before.
Both servers are reacting fine, and doing a few other jobs without 
problems. The Terminal shows this:

VM Started: FetcherJob: starting
FetcherJob: batchId: 1361519572-1552351269
FetcherJob: threads: 30
FetcherJob: parsing: true
FetcherJob: resuming: true
FetcherJob : timelimit set for : -1

--Roland

Am 22.02.2013 09:39, schrieb Lewis John Mcgibbney:

Roland, i am curious to know exactly what is happening between the
fetcherJob initiation and actual fetch of 1st URL. Does the terminal just
hang? Can you track some metrics of the job?

On Thursday, February 21, 2013, Roland  wrote:

Hi Julien,

the point I personally don't get, is: why is generating fast - fetching

not.

If it's possible to filter the generatorJob at the backend (what I think

it does), shouldn't it be possible to do the same for the fetcher?

--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:

Lewis,

The point is whether the filtering is done on the backend side (e.g.

using

queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to

mapreduce

even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
wrote:


Those filters are applied only to URLs which do not have a null
GENERATE_MARK
e.g.

  if (Mark.GENERATE_MARK.checkMark(page) != null) {
if (GeneratorJob.LOG.isDebugEnabled()) {
  GeneratorJob.LOG.debug("Skipping " + url + "; already

generated");

}
return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM,  wrote:


Hi,

Are those filters put on all data selected from hbase or sent to hbase

as

filters to select a subset of all hbase records?

Thanks.
Alex.







-Original Message-
From: Lewis John Mcgibbney 
To: user 
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM,  wrote:


The generator also does not have filters. Its mapper  goes over all
records as far as I know. If you use hadoop you can see how many

records

go

as input to mappers. Also see this


I don't think this is true. The GeneratorMapper filters URLs before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher

obtaining

the URLs which have been marked with a desired batchId. This would be

done

via scanner in Gora.




--
*Lewis*









Re: nutch with cassandra internal network usage

2013-02-22 Thread Julien Nioche
Hi Roland

My previous email should have started with "The point Alex is making is ..."
and not just "The point is ...".
I don't have an explanation as to why the generator is faster than the
fetching as I don't use 2.x at all but it would definitely be interesting
to find out. The behaviour of the fetcher is how I expect GORA to behave in
its current form i.e. pull everything - filter - process.

Julien


On 21 February 2013 16:58, Roland  wrote:

> Hi Julien,
>
> the point I personally don't get, is: why is generating fast - fetching
> not.
> If it's possible to filter the generatorJob at the backend (what I think
> it does), shouldn't it be possible to do the same for the fetcher?
>
> --Roland
>
> Am 21.02.2013 12:27, schrieb Julien Nioche:
>
>  Lewis,
>>
>> The point is whether the filtering is done on the backend side (e.g. using
>> queries, indices, etc...) then passed on to MapReduce via GORA or as I
>> assume by looking at the code filtered within the MapReduce which means
>> that all the entries are pulled from the backend anyway.
>> This makes quite a difference in terms of performance if you think e.g
>> about a large webtable which would have to be entirely passed to mapreduce
>> even if only a handful of entries are to be processed.
>>
>> Makes sense?
>>
>> Julien
>>
>>
>> On 21 February 2013 01:52, Lewis John Mcgibbney
>> **wrote:
>>
>>  Those filters are applied only to URLs which do not have a null
>>> GENERATE_MARK
>>> e.g.
>>>
>>>  if (Mark.GENERATE_MARK.checkMark(**page) != null) {
>>>if (GeneratorJob.LOG.**isDebugEnabled()) {
>>>  GeneratorJob.LOG.debug("**Skipping " + url + "; already
>>> generated");
>>>}
>>>return;
>>>
>>> Therefore filters will be applied to all URLs which have a null
>>> GENERATE_MARK value.
>>>
>>> On Wed, Feb 20, 2013 at 2:45 PM,  wrote:
>>>
>>>  Hi,
>>>>
>>>> Are those filters put on all data selected from hbase or sent to hbase
>>>> as
>>>> filters to select a subset of all hbase records?
>>>>
>>>> Thanks.
>>>> Alex.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -Original Message-
>>>> From: Lewis John Mcgibbney 
>>>> To: user 
>>>> Sent: Wed, Feb 20, 2013 12:56 pm
>>>> Subject: Re: nutch with cassandra internal network usage
>>>>
>>>>
>>>> Hi Alex,
>>>>
>>>> On Wed, Feb 20, 2013 at 11:54 AM,  wrote:
>>>>
>>>>  The generator also does not have filters. Its mapper  goes over all
>>>>> records as far as I know. If you use hadoop you can see how many
>>>>>
>>>> records
>>>
>>>> go
>>>>
>>>>> as input to mappers. Also see this
>>>>>
>>>>>  I don't think this is true. The GeneratorMapper filters URLs before
>>>> selecting them for inclusion based on the following
>>>> - distance
>>>> - URLNormalizer(s)
>>>> - URLFilter(s)
>>>> in that order.
>>>> I am going to start a new thread on improvements to the GeneratorJob
>>>> regarding better configuration as it is a crucial stage in the crawl
>>>> process.
>>>>
>>>> So the issue here, as you correctly explain, is with the Fetcher
>>>>
>>> obtaining
>>>
>>>> the URLs which have been marked with a desired batchId. This would be
>>>>
>>> done
>>>
>>>> via scanner in Gora.
>>>>
>>>>
>>>>
>>>>
>>> --
>>> *Lewis*
>>>
>>>
>>
>>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: nutch with cassandra internal network usage

2013-02-22 Thread Lewis John Mcgibbney
Roland, i am curious to know exactly what is happening between the
fetcherJob initiation and actual fetch of 1st URL. Does the terminal just
hang? Can you track some metrics of the job?

On Thursday, February 21, 2013, Roland  wrote:
> Hi Julien,
>
> the point I personally don't get, is: why is generating fast - fetching
not.
> If it's possible to filter the generatorJob at the backend (what I think
it does), shouldn't it be possible to do the same for the fetcher?
>
> --Roland
>
> Am 21.02.2013 12:27, schrieb Julien Nioche:
>>
>> Lewis,
>>
>> The point is whether the filtering is done on the backend side (e.g.
using
>> queries, indices, etc...) then passed on to MapReduce via GORA or as I
>> assume by looking at the code filtered within the MapReduce which means
>> that all the entries are pulled from the backend anyway.
>> This makes quite a difference in terms of performance if you think e.g
>> about a large webtable which would have to be entirely passed to
mapreduce
>> even if only a handful of entries are to be processed.
>>
>> Makes sense?
>>
>> Julien
>>
>>
>> On 21 February 2013 01:52, Lewis John Mcgibbney
>> wrote:
>>
>>> Those filters are applied only to URLs which do not have a null
>>> GENERATE_MARK
>>> e.g.
>>>
>>>  if (Mark.GENERATE_MARK.checkMark(page) != null) {
>>>if (GeneratorJob.LOG.isDebugEnabled()) {
>>>  GeneratorJob.LOG.debug("Skipping " + url + "; already
generated");
>>>}
>>>return;
>>>
>>> Therefore filters will be applied to all URLs which have a null
>>> GENERATE_MARK value.
>>>
>>> On Wed, Feb 20, 2013 at 2:45 PM,  wrote:
>>>
>>>> Hi,
>>>>
>>>> Are those filters put on all data selected from hbase or sent to hbase
as
>>>> filters to select a subset of all hbase records?
>>>>
>>>> Thanks.
>>>> Alex.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -Original Message-
>>>> From: Lewis John Mcgibbney 
>>>> To: user 
>>>> Sent: Wed, Feb 20, 2013 12:56 pm
>>>> Subject: Re: nutch with cassandra internal network usage
>>>>
>>>>
>>>> Hi Alex,
>>>>
>>>> On Wed, Feb 20, 2013 at 11:54 AM,  wrote:
>>>>
>>>>> The generator also does not have filters. Its mapper  goes over all
>>>>> records as far as I know. If you use hadoop you can see how many
>>>
>>> records
>>>>
>>>> go
>>>>>
>>>>> as input to mappers. Also see this
>>>>>
>>>> I don't think this is true. The GeneratorMapper filters URLs before
>>>> selecting them for inclusion based on the following
>>>> - distance
>>>> - URLNormalizer(s)
>>>> - URLFilter(s)
>>>> in that order.
>>>> I am going to start a new thread on improvements to the GeneratorJob
>>>> regarding better configuration as it is a crucial stage in the crawl
>>>> process.
>>>>
>>>> So the issue here, as you correctly explain, is with the Fetcher
>>>
>>> obtaining
>>>>
>>>> the URLs which have been marked with a desired batchId. This would be
>>>
>>> done
>>>>
>>>> via scanner in Gora.
>>>>
>>>>
>>>>
>>>
>>> --
>>> *Lewis*
>>>
>>
>>
>
>

-- 
*Lewis*


Re: nutch with cassandra internal network usage

2013-02-21 Thread Roland

Hi Julien,

the point I personally don't get, is: why is generating fast - fetching not.
If it's possible to filter the generatorJob at the backend (what I think 
it does), shouldn't it be possible to do the same for the fetcher?


--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:

Lewis,

The point is whether the filtering is done on the backend side (e.g. using
queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to mapreduce
even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
wrote:


Those filters are applied only to URLs which do not have a null
GENERATE_MARK
e.g.

 if (Mark.GENERATE_MARK.checkMark(page) != null) {
   if (GeneratorJob.LOG.isDebugEnabled()) {
 GeneratorJob.LOG.debug("Skipping " + url + "; already generated");
   }
   return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM,  wrote:


Hi,

Are those filters put on all data selected from hbase or sent to hbase as
filters to select a subset of all hbase records?

Thanks.
Alex.







-Original Message-
From: Lewis John Mcgibbney 
To: user 
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM,  wrote:


The generator also does not have filters. Its mapper  goes over all
records as far as I know. If you use hadoop you can see how many

records

go

as input to mappers. Also see this


I don't think this is true. The GeneratorMapper filters URLs before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher

obtaining

the URLs which have been marked with a desired batchId. This would be

done

via scanner in Gora.





--
*Lewis*








Re: nutch with cassandra internal network usage

2013-02-21 Thread Lewis John Mcgibbney
I get it fine. I do think it important to discuss the current filtering
code in the generator though. Yeah, okay, it turns out that our current
implementation (which reads all entries then does filtering on Nutch side)
can be horribly expensive but at least there is some mechanism in place
right? We will work on the scan over in Gora after 0.3 release.

Null unions in avro schemas e.g. GORA-174 has been kicking our head in but
we are getting there. As always, anyone interested in contributing to the
cause, please shoot over to user@gora
Thanks
Lewis

On Thursday, February 21, 2013, Julien Nioche 
wrote:
> Lewis,
>
> The point is whether the filtering is done on the backend side (e.g. using
> queries, indices, etc...) then passed on to MapReduce via GORA or as I
> assume by looking at the code filtered within the MapReduce which means
> that all the entries are pulled from the backend anyway.
> This makes quite a difference in terms of performance if you think e.g
> about a large webtable which would have to be entirely passed to mapreduce
> even if only a handful of entries are to be processed.
>
> Makes sense?
>
> Julien
>
>
> On 21 February 2013 01:52, Lewis John Mcgibbney
> wrote:
>
>> Those filters are applied only to URLs which do not have a null
>> GENERATE_MARK
>> e.g.
>>
>> if (Mark.GENERATE_MARK.checkMark(page) != null) {
>>   if (GeneratorJob.LOG.isDebugEnabled()) {
>> GeneratorJob.LOG.debug("Skipping " + url + "; already
generated");
>>   }
>>   return;
>>
>> Therefore filters will be applied to all URLs which have a null
>> GENERATE_MARK value.
>>
>> On Wed, Feb 20, 2013 at 2:45 PM,  wrote:
>>
>> > Hi,
>> >
>> > Are those filters put on all data selected from hbase or sent to hbase
as
>> > filters to select a subset of all hbase records?
>> >
>> > Thanks.
>> > Alex.
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > -Original Message-
>> > From: Lewis John Mcgibbney 
>> > To: user 
>> > Sent: Wed, Feb 20, 2013 12:56 pm
>> > Subject: Re: nutch with cassandra internal network usage
>> >
>> >
>> > Hi Alex,
>> >
>> > On Wed, Feb 20, 2013 at 11:54 AM,  wrote:
>> >
>> > >
>> > > The generator also does not have filters. Its mapper  goes over all
>> > > records as far as I know. If you use hadoop you can see how many
>> records
>> > go
>> > > as input to mappers. Also see this
>> > >
>> >
>> > I don't think this is true. The GeneratorMapper filters URLs before
>> > selecting them for inclusion based on the following
>> > - distance
>> > - URLNormalizer(s)
>> > - URLFilter(s)
>> > in that order.
>> > I am going to start a new thread on improvements to the GeneratorJob
>> > regarding better configuration as it is a crucial stage in the crawl
>> > process.
>> >
>> > So the issue here, as you correctly explain, is with the Fetcher
>> obtaining
>> > the URLs which have been marked with a desired batchId. This would be
>> done
>> > via scanner in Gora.
>> >
>> >
>> >
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

-- 
*Lewis*


Re: nutch with cassandra internal network usage

2013-02-21 Thread Julien Nioche
Lewis,

The point is whether the filtering is done on the backend side (e.g. using
queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which means
that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to mapreduce
even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
wrote:

> Those filters are applied only to URLs which do not have a null
> GENERATE_MARK
> e.g.
>
> if (Mark.GENERATE_MARK.checkMark(page) != null) {
>   if (GeneratorJob.LOG.isDebugEnabled()) {
> GeneratorJob.LOG.debug("Skipping " + url + "; already generated");
>   }
>   return;
>
> Therefore filters will be applied to all URLs which have a null
> GENERATE_MARK value.
>
> On Wed, Feb 20, 2013 at 2:45 PM,  wrote:
>
> > Hi,
> >
> > Are those filters put on all data selected from hbase or sent to hbase as
> > filters to select a subset of all hbase records?
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-
> > From: Lewis John Mcgibbney 
> > To: user 
> > Sent: Wed, Feb 20, 2013 12:56 pm
> > Subject: Re: nutch with cassandra internal network usage
> >
> >
> > Hi Alex,
> >
> > On Wed, Feb 20, 2013 at 11:54 AM,  wrote:
> >
> > >
> > > The generator also does not have filters. Its mapper  goes over all
> > > records as far as I know. If you use hadoop you can see how many
> records
> > go
> > > as input to mappers. Also see this
> > >
> >
> > I don't think this is true. The GeneratorMapper filters URLs before
> > selecting them for inclusion based on the following
> > - distance
> > - URLNormalizer(s)
> > - URLFilter(s)
> > in that order.
> > I am going to start a new thread on improvements to the GeneratorJob
> > regarding better configuration as it is a crucial stage in the crawl
> > process.
> >
> > So the issue here, as you correctly explain, is with the Fetcher
> obtaining
> > the URLs which have been marked with a desired batchId. This would be
> done
> > via scanner in Gora.
> >
> >
> >
>
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
Those filters are applied only to URLs which do not have a null
GENERATE_MARK
e.g.

if (Mark.GENERATE_MARK.checkMark(page) != null) {
  if (GeneratorJob.LOG.isDebugEnabled()) {
GeneratorJob.LOG.debug("Skipping " + url + "; already generated");
  }
  return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM,  wrote:

> Hi,
>
> Are those filters put on all data selected from hbase or sent to hbase as
> filters to select a subset of all hbase records?
>
> Thanks.
> Alex.
>
>
>
>
>
>
>
> -Original Message-
> From: Lewis John Mcgibbney 
> To: user 
> Sent: Wed, Feb 20, 2013 12:56 pm
> Subject: Re: nutch with cassandra internal network usage
>
>
> Hi Alex,
>
> On Wed, Feb 20, 2013 at 11:54 AM,  wrote:
>
> >
> > The generator also does not have filters. Its mapper  goes over all
> > records as far as I know. If you use hadoop you can see how many records
> go
> > as input to mappers. Also see this
> >
>
> I don't think this is true. The GeneratorMapper filters URLs before
> selecting them for inclusion based on the following
> - distance
> - URLNormalizer(s)
> - URLFilter(s)
> in that order.
> I am going to start a new thread on improvements to the GeneratorJob
> regarding better configuration as it is a crucial stage in the crawl
> process.
>
> So the issue here, as you correctly explain, is with the Fetcher obtaining
> the URLs which have been marked with a desired batchId. This would be done
> via scanner in Gora.
>
>
>


-- 
*Lewis*


Re: nutch with cassandra internal network usage

2013-02-20 Thread alxsss
Hi,

Are those filters put on all data selected from hbase or sent to hbase as 
filters to select a subset of all hbase records?

Thanks.
Alex.

 

 

 

-Original Message-
From: Lewis John Mcgibbney 
To: user 
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM,  wrote:

>
> The generator also does not have filters. Its mapper  goes over all
> records as far as I know. If you use hadoop you can see how many records go
> as input to mappers. Also see this
>

I don't think this is true. The GeneratorMapper filters URLs before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher obtaining
the URLs which have been marked with a desired batchId. This would be done
via scanner in Gora.

 


Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
Hi,

Please head over to most recent thread on dev@ for potential improvements
for the Generator* code.

Thanks for invoking this discussion, it is well overdue.

Lewis



On Wed, Feb 20, 2013 at 12:55 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Alex,
>
>
> On Wed, Feb 20, 2013 at 11:54 AM,  wrote:
>
>>
>> The generator also does not have filters. Its mapper  goes over all
>> records as far as I know. If you use hadoop you can see how many records go
>> as input to mappers. Also see this
>>
>
> I don't think this is true. The GeneratorMapper filters URLs before
> selecting them for inclusion based on the following
> - distance
> - URLNormalizer(s)
> - URLFilter(s)
> in that order.
> I am going to start a new thread on improvements to the GeneratorJob
> regarding better configuration as it is a crucial stage in the crawl
> process.
>
> So the issue here, as you correctly explain, is with the Fetcher obtaining
> the URLs which have been marked with a desired batchId. This would be done
> via scanner in Gora.
>



-- 
*Lewis*


Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM,  wrote:

>
> The generator also does not have filters. Its mapper  goes over all
> records as far as I know. If you use hadoop you can see how many records go
> as input to mappers. Also see this
>

I don't think this is true. The GeneratorMapper filters URLs before
selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher obtaining
the URLs which have been marked with a desired batchId. This would be done
via scanner in Gora.


Re: nutch with cassandra internal network usage

2013-02-20 Thread alxsss

The generator also does not have filters. Its mapper  goes over all records as 
far as I know. If you use hadoop you can see how many records go as input to 
mappers. Also see this

https://issues.apache.org/jira/browse/GORA-119

Alex.

 

 

 

-Original Message-
From: Roland 
To: user 
Sent: Wed, Feb 20, 2013 11:47 am
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

the GeneratorJob seems to have a solution for that, if not it would 
iterate over all records too, am I right?

--Roland

Am 20.02.2013 20:42, schrieb alx...@aim.com:
> Hi,
>
> This is because fetch's mapper goes over all records and selects those that 
has the given batchId. Currently mappers of all nutch commands does not have 
filters.
> It is interesting to know if you can selects records with a given batchId in 
cassandra without iterating over all records.
>
>
> Alex.
>

 
 


Re: nutch with cassandra internal network usage

2013-02-20 Thread Roland

Hi Alex,

the GeneratorJob seems to have a solution for that, if not it would 
iterate over all records too, am I right?


--Roland

Am 20.02.2013 20:42, schrieb alx...@aim.com:

Hi,

This is because fetch's mapper goes over all records and selects those that has 
the given batchId. Currently mappers of all nutch commands does not have 
filters.
It is interesting to know if you can selects records with a given batchId in 
cassandra without iterating over all records.


Alex.



Re: nutch with cassandra internal network usage

2013-02-20 Thread alxsss
Hi,

This is because fetch's mapper goes over all records and selects those that has 
the given batchId. Currently mappers of all nutch commands does not have 
filters.
It is interesting to know if you can selects records with a given batchId in 
cassandra without iterating over all records.


Alex.

 

 

 

-Original Message-
From: Roland 
To: user 
Sent: Wed, Feb 20, 2013 10:56 am
Subject: Re: nutch with cassandra internal network usage


Hi Lewis,

the GeneratorJob takes only ~5 minutes.
I'm running it in standalone mode, like this:
./bin/nutch fetch 1361367698-1708119958 -threads 40

It's configured to fetch & parse, but it makes no difference if it only 
fetches:
FetcherJob: starting
FetcherJob: batchId: 1361367698-1708119958
FetcherJob: threads: 40
FetcherJob: parsing: true
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1

--Roland


Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:
> Hi Roland,
>
> You say you start a fetch run, does this mean the FetcherJob or
> GeneratorJob? What kind of settings do you run your zNutch server with?

 


Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
I am assuming that your generate.max.count property value is set to the
default -1? Have you tried configuring more, smaller batchId's (fetch
lists)?
I don't have an immediate answer as to why overall, the FetcherJob is
taking this amount of time and resources

On Wednesday, February 20, 2013, Roland  wrote:
> Hi Lewis,
>
> the GeneratorJob takes only ~5 minutes.
> I'm running it in standalone mode, like this:
> ./bin/nutch fetch 1361367698-1708119958 -threads 40
>
> It's configured to fetch & parse, but it makes no difference if it only
fetches:
> FetcherJob: starting
> FetcherJob: batchId: 1361367698-1708119958
> FetcherJob: threads: 40
> FetcherJob: parsing: true
> FetcherJob: resuming: false
> FetcherJob : timelimit set for : -1
>
> --Roland
>
>
> Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:
>>
>> Hi Roland,
>>
>> You say you start a fetch run, does this mean the FetcherJob or
>> GeneratorJob? What kind of settings do you run your zNutch server with?
>

-- 
*Lewis*


Re: nutch with cassandra internal network usage

2013-02-20 Thread Roland

Hi Lewis,

the GeneratorJob takes only ~5 minutes.
I'm running it in standalone mode, like this:
./bin/nutch fetch 1361367698-1708119958 -threads 40

It's configured to fetch & parse, but it makes no difference if it only 
fetches:

FetcherJob: starting
FetcherJob: batchId: 1361367698-1708119958
FetcherJob: threads: 40
FetcherJob: parsing: true
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1

--Roland


Am 20.02.2013 19:44, schrieb Lewis John Mcgibbney:

Hi Roland,

You say you start a fetch run, does this mean the FetcherJob or
GeneratorJob? What kind of settings do you run your zNutch server with?


Re: nutch with cassandra internal network usage

2013-02-20 Thread Lewis John Mcgibbney
Hi Roland,

You say you start a fetch run, does this mean the FetcherJob or
GeneratorJob? What kind of settings do you run your zNutch server with?

On Wednesday, February 20, 2013, Roland  wrote:
> Hi list,
>
> we're experimenting with nutch 2.1 and cassandra 1.2.1 (on ? hosts).
> Our cassandra 'webpage' store has about 31GB right now on disk, we add
URLs by 'injecting' them, about 100k-300k per cycle.
> When starting a 'fetch' run, it now needs about an hour before the queues
are set up / the first page is fetched.
> During this time we can see about 180MBit/s network traffic from the
cassandra host to the nutch host (outgoing of cassandra).
> If I calculate the transferred data during this time (taking only
150Mbit/s into account):
> 150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB
>
> So, why does nutch load all data from the db, and not only the relevant
data of this fetch? And why does it happen twice?
>
> Thanks,
> Roland
>

-- 
*Lewis*


nutch with cassandra internal network usage

2013-02-20 Thread Roland

Hi list,

we're experimenting with nutch 2.1 and cassandra 1.2.1 (on different hosts).
Our cassandra 'webpage' store has about 31GB right now on disk, we add 
URLs by 'injecting' them, about 100k-300k per cycle.
When starting a 'fetch' run, it now needs about an hour before the 
queues are set up / the first page is fetched.
During this time we can see about 180MBit/s network traffic from the 
cassandra host to the nutch host (outgoing of cassandra).
If I calculate the transferred data during this time (taking only 
150Mbit/s into account):

150MBit/s*1000*1000/8/1024/1024/1024*3600sec ~= 62GB

So, why does nutch load all data from the db, and not only the relevant 
data of this fetch? And why does it happen twice?


Thanks,
Roland