Re: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread shri_s_ram
I get it now! Thanks a lot!
I was running my crawl command with fetcher.parse as true which was
creating the problem..

On Thu, Oct 18, 2012 at 5:53 PM, Markus Jelsma-2 [via Lucene] <
ml-node+s472066n4014609...@n3.nabble.com> wrote:

> You would have to check the generator code to make sure. But why would you
> want to distribute the queue for a single domain to multiple mappers? A
> single local running mapper without parsing on a low-end machine can easily
> fetch 20-40 records per second from the same domain (if it allows you to do
> it). At that speed you can easily fetch a few million records in a day
> orso.
>
> -Original message-
>
> > From:shri_s_ram <[hidden 
> > email]>
>
> > Sent: Thu 18-Oct-2012 23:11
> > To: [hidden email]
> > Subject: RE: Nutch generate fetch lists for a single domain (but with
> multiple urls) crawl
> >
> > Thanks.. But I thought there would be a way around it..
> > Is it possible even to have multiple fetch lists generated (for this
> > problem) at all by tweaking some parameters?
> >
> > [I am thinking of something like partition.url.mode - byRandom]
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-generate-fetch-lists-for-a-single-domain-but-with-multiple-urls-crawl-tp4014573p4014582.html
>
> > Sent from the Nutch - User mailing list archive at Nabble.com.
> >
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Nutch-generate-fetch-lists-for-a-single-domain-but-with-multiple-urls-crawl-tp4014573p4014609.html
>  To unsubscribe from Nutch generate fetch lists for a single domain (but
> with multiple urls) crawl, click 
> here
> .
> NAML
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-generate-fetch-lists-for-a-single-domain-but-with-multiple-urls-crawl-tp4014573p4014626.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Nutch 2.x : ParseUtil failing for some pdf files

2012-10-18 Thread j.sullivan
Kiran,

I took a look at your nutch-site.xml and I did not see anything for 
http.accept. I believe nutch-default.xml does not include application/pdf by 
default in http.accept so you may need to add it in your nutch-site.xml.  
Please take a look at the example below from my nutch-site.xml



  http.accept
  
text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8
  Value of the "Accept" request header field.
  


Good Luck

James

-Original Message-
From: kiran chitturi [mailto:chitturikira...@gmail.com] 
Sent: Friday, October 19, 2012 6:41 AM
To: user@nutch.apache.org
Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files

Hi James,

I have increased the limit in nutch-site.xml (
https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have 
created the webpage table based on the fields here ( 
http://nlp.solutions.asia/?p=180).

The database stills shows the parseStatus as
'-org.apache.nutch.parse.ParseException: Unable to successfully parse 
content'.  I am having text field nutch 'null' for them. This the the 
screenshot 
of

mysql database that i have.

Can you please tell me how can i overcome this problem ? This is the 
screenshot
of
my webpage table.

Many Thanks for your help.

Regards,
Kiran.

On Wed, Oct 17, 2012 at 6:20 AM,  wrote:

> Hi Kiran,
>
> I agree with Julien it is probably trimmed content.
>
> I regularly parse PDFs with Nutch 2.x with MySQL as the backend 
> without problem (even without the patch).
>
> The differences in my set up from the standard set up that may be
> applicable:
>
> 1) In nutch-site.xml the file.content.limit and http.content.limit are 
> set to 600.
> 2) I have a custom create webpage table sql script that creates fields 
> that can hold more.  The default table fields are not sufficiently 
> large in most real world situations. http://nlp.solutions.asia/?p=180
>
> I crawled http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/ and it 
> successfully parsed all except one of the PDFs, v29n3.pdf. That PDF is 
> almost 20 megs much larger than the limit in nutch-default.xml and 
> even larger than that configured in my nutch-site.xml. Interestingly 
> that PDF is also completely pictures (what looks like text is actually 
> pictures of
> text) so there may be no real text to parse.
>
> James
>
> 
> From: Julien Nioche [lists.digitalpeb...@gmail.com]
> Sent: Wednesday, October 17, 2012 4:17 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files
>
> trimmed content?
>
> On 16 October 2012 22:47, kiran chitturi 
> wrote:
>
> > Hi,
> >
> > I am running Nutch 2.x with patch here at
> > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a
> mysql
> > database.
> >
> > After the {inject, generate, fetch} commands when i issue the 
> > command (sh bin/nutch parse 1350396627-126726428) the parserJob was 
> > success but when
> i
> > look inside the database only one pdf file is parsed out of 10.
> >
> > When i look in to hadoop.log it shows the statement '2012-10-16
> > 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully parse 
> > content 
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type 
> > application/pdf' like this.
> >
> > The logs of successfully parsed and failed ones are below. The logs 
> > below show that pdf file '.../agosto.pdf' is parsed and the file 
> > '/authors.pdf' is not parsed.
> >
> > The same thing happened for all other pdf files, the parse failed. 
> > When i do the 'sh bin/nutch parsechecker {url}' it worked with the 
> > failed pdf files and it does not show any errors.
> >
> >
> > 2012-10-16 16:04:28,150 INFO  parse.ParserJob - Parsing
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf
> > > 2012-10-16 16:04:28,151 INFO  parse.ParserFactory - The parsing
> plugins:
> > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the 
> > > plugin.includes system property, and all claim to support the 
> > > content
> > type
> > > application/pdf, but they are not mapp ed to it  in the 
> > > parse-plugins.xml file
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > content-type  application/pdf
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:modified  2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:creation-date2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:save-date2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-modified 2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  pa

RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread Markus Jelsma
You would have to check the generator code to make sure. But why would you want 
to distribute the queue for a single domain to multiple mappers? A single local 
running mapper without parsing on a low-end machine can easily fetch 20-40 
records per second from the same domain (if it allows you to do it). At that 
speed you can easily fetch a few million records in a day orso.

-Original message-
> From:shri_s_ram 
> Sent: Thu 18-Oct-2012 23:11
> To: user@nutch.apache.org
> Subject: RE: Nutch generate fetch lists for a single domain (but with 
> multiple urls) crawl
> 
> Thanks.. But I thought there would be a way around it..
> Is it possible even to have multiple fetch lists generated (for this
> problem) at all by tweaking some parameters?
> 
> [I am thinking of something like partition.url.mode - byRandom]  
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-generate-fetch-lists-for-a-single-domain-but-with-multiple-urls-crawl-tp4014573p4014582.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


Re: Nutch 2.x : ParseUtil failing for some pdf files

2012-10-18 Thread kiran chitturi
Hi James,

I have increased the limit in nutch-site.xml (
https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have
created the webpage table based on the fields here (
http://nlp.solutions.asia/?p=180).

The database stills shows the parseStatus as
'–org.apache.nutch.parse.ParseException: Unable to successfully parse
content'.  I am having text field nutch 'null' for them. This the the
screenshot
of
mysql database that i have.

Can you please tell me how can i overcome this problem ? This is the
screenshot
of
my webpage table.

Many Thanks for your help.

Regards,
Kiran.

On Wed, Oct 17, 2012 at 6:20 AM,  wrote:

> Hi Kiran,
>
> I agree with Julien it is probably trimmed content.
>
> I regularly parse PDFs with Nutch 2.x with MySQL as the backend without
> problem (even without the patch).
>
> The differences in my set up from the standard set up that may be
> applicable:
>
> 1) In nutch-site.xml the file.content.limit and http.content.limit are set
> to 600.
> 2) I have a custom create webpage table sql script that creates fields
> that can hold more.  The default table fields are not sufficiently large in
> most real world situations. http://nlp.solutions.asia/?p=180
>
> I crawled http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/ and it
> successfully parsed all except one of the PDFs, v29n3.pdf. That PDF is
> almost 20 megs much larger than the limit in nutch-default.xml and even
> larger than that configured in my nutch-site.xml. Interestingly that PDF is
> also completely pictures (what looks like text is actually pictures of
> text) so there may be no real text to parse.
>
> James
>
> 
> From: Julien Nioche [lists.digitalpeb...@gmail.com]
> Sent: Wednesday, October 17, 2012 4:17 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch 2.x : ParseUtil failing for some pdf files
>
> trimmed content?
>
> On 16 October 2012 22:47, kiran chitturi 
> wrote:
>
> > Hi,
> >
> > I am running Nutch 2.x with patch here at
> > https://issues.apache.org/jira/browse/NUTCH-1433 and connected to a
> mysql
> > database.
> >
> > After the {inject, generate, fetch} commands when i issue the command (sh
> > bin/nutch parse 1350396627-126726428) the parserJob was success but when
> i
> > look inside the database only one pdf file is parsed out of 10.
> >
> > When i look in to hadoop.log it shows the statement '2012-10-16
> > 16:04:30,682 WARN  parse.ParseUtil - Unable to successfully parse content
> > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/authors.pdf of type
> > application/pdf' like this.
> >
> > The logs of successfully parsed and failed ones are below. The logs below
> > show that pdf file '.../agosto.pdf' is parsed and the file
> > '/authors.pdf' is not parsed.
> >
> > The same thing happened for all other pdf files, the parse failed. When i
> > do the 'sh bin/nutch parsechecker {url}' it worked with the failed pdf
> > files and it does not show any errors.
> >
> >
> > 2012-10-16 16:04:28,150 INFO  parse.ParserJob - Parsing
> > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/agosto.pdf
> > > 2012-10-16 16:04:28,151 INFO  parse.ParserFactory - The parsing
> plugins:
> > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > > plugin.includes system property, and all claim to support the content
> > type
> > > application/pdf, but they are not mapp
> > > ed to it  in the parse-plugins.xml file
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > content-type  application/pdf
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:modified  2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:creation-date2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > meta:save-date2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > last-modified 2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dc:creatorDenise E. Agosto
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > dcterms:created   2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > creation-date 2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > date
> > >  2010-10-20T21:12:47Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > xmp:creatortool   ScanWizard 5
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsParser - Found meta tag :
> > > modified  2010-11-02T20:51:27Z
> > > 2012-10-16 16:04:30,549 WARN  parse.MetaTagsPars

RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread shri_s_ram
Thanks.. But I thought there would be a way around it..
Is it possible even to have multiple fetch lists generated (for this
problem) at all by tweaking some parameters?

[I am thinking of something like partition.url.mode - byRandom]  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-generate-fetch-lists-for-a-single-domain-but-with-multiple-urls-crawl-tp4014573p4014582.html
Sent from the Nutch - User mailing list archive at Nabble.com.


RE: Nutch generate fetch lists for a single domain (but with multiple urls) crawl

2012-10-18 Thread Markus Jelsma
Hi - the generator tool partitions URL's by host, domain or IP address, they'll 
all end up in the same fetch list. Since you're doing only one domain there is 
no need to run additional mappers. If you want to crawl them as fast as you can 
(and you are allowed to do that) then use only one mapper and increase the 
number of threads and the number of threads per queue.

Keep in mind that it is considered impolite to crawl some host with too many 
threads and too little delay between successive fetches. You can do it if you 
own the host or have an agreement to do it. Reuters.com won't appreciate many 
URL's fetched with 10 threads without delay.
 
-Original message-
> From:shri_s_ram 
> Sent: Thu 18-Oct-2012 22:40
> To: user@nutch.apache.org
> Subject: Nutch generate fetch lists for a single domain (but with multiple 
> urls) crawl
> 
> Hi I am using Apache Nutch to crawl a website (say reuters.com). My seed urls
> are like the following
> 1.http://www.reuters.com/news/archive?view=page&page=1&pageSize=10,
> 2.http://www.reuters.com/news/archive?view=page&page=2&pageSize=10.. Now
> when I use the crawl command with 100 mapred.map.tasks parameter and
> partition.url.mode - byHost, Nutch generates 100 fetch lists but only one of
> them has all the urls. This in turn meant that out of 100 fetch jobs one of
> them takes a long time (actually all the time) I need to fetch urls from the
> same domain (but different urls) in multiple fetch jobs. Can someone help me
> out with the parameter setting for the same? Is this possible? Cheers
> Shriram Sridharan 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Nutch-generate-fetch-lists-for-a-single-domain-but-with-multiple-urls-crawl-tp4014573.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread Julien Nioche
off topic. we are talking about an issue with the SQL backend in GORA, not
the performance of Nutch.


Julien

On 18 October 2012 20:28, Stefan Scheffler wrote:

> Hi,
> The problem why nutch is so slow is, that all of the steps uses hadoop
> jobs which takes a long time to start. As well there is somewhere a
> hardcoded 3 second delay in the hadoop core which makes sense in
> distributed systems. But not on single machines.
>
> Regards
> stefan
>
> Am 18.10.2012 17:55, schrieb Luca Vasarelli:
>
>  Hello,
>>
>> I'm using Nutch 2.1 on Linux and I'm having similar problem of
>> http://goo.gl/nrDLV, my Nutch is fetching same pages at each round.
>>
>> I've built a simple localhost site, with 3 pages linked each other:
>> first.htm -> second.htm -> third.htm
>>
>> I did these steps:
>>
>> - downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH}
>> - edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql
>> backend (thanks to [1])
>> - edited ${TEMP_NUTCH}/conf/gora.**properties removing default sql
>> configuration and adding mysql properties (thanks to [1])
>> - ran "ant runtime" from ${TEMP_NUTCH}
>> - moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
>> - edited ${NUTCH_HOME}/conf/nutch-site.**xml adding http.agent.name,
>> http.robots.agents and changing db.ignore.external.links to true and
>> fetcher.server.delay to 0.0
>> - created ${NUTCH_HOME}/urls/seed.txt with "http://localhost/test/first.*
>> *htm " inside this file
>> - created db table as [1]
>>
>> Then I ran "bin/nutch crawl urls -threads 1"
>>
>> first.htm was fetched 5 times
>> second.htm was fetched 4 times
>> third.htm was fetched 3 times
>>
>> I tried doing each step separately (inject, generate, ...) with the same
>> results.
>>
>> Also the whole process take about 2 minutes, am I missing something about
>> some delay config or is this normal?
>>
>> Some extra info:
>>
>> - HTML of the pages: http://pastebin.com/dyDPJeZs
>> - Hadoop log: http://pastebin.com/rwQQPnkE
>> - nutch-site.xml: http://pastebin.com/0WArkvh5
>> - Wireshark log: http://pastebin.com/g4Bg17Ls
>> - MySQL table: http://pastebin.com/gD2SvGsy
>>
>> [1] http://nlp.solutions.asia/?p=**180 
>>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread Stefan Scheffler

Hi,
The problem why nutch is so slow is, that all of the steps uses hadoop 
jobs which takes a long time to start. As well there is somewhere a 
hardcoded 3 second delay in the hadoop core which makes sense in 
distributed systems. But not on single machines.


Regards
stefan

Am 18.10.2012 17:55, schrieb Luca Vasarelli:

Hello,

I'm using Nutch 2.1 on Linux and I'm having similar problem of 
http://goo.gl/nrDLV, my Nutch is fetching same pages at each round.


I've built a simple localhost site, with 3 pages linked each other:
first.htm -> second.htm -> third.htm

I did these steps:

- downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH}
- edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql 
backend (thanks to [1])
- edited ${TEMP_NUTCH}/conf/gora.properties removing default sql 
configuration and adding mysql properties (thanks to [1])

- ran "ant runtime" from ${TEMP_NUTCH}
- moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
- edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name, 
http.robots.agents and changing db.ignore.external.links to true and 
fetcher.server.delay to 0.0
- created ${NUTCH_HOME}/urls/seed.txt with 
"http://localhost/test/first.htm"; inside this file

- created db table as [1]

Then I ran "bin/nutch crawl urls -threads 1"

first.htm was fetched 5 times
second.htm was fetched 4 times
third.htm was fetched 3 times

I tried doing each step separately (inject, generate, ...) with the 
same results.


Also the whole process take about 2 minutes, am I missing something 
about some delay config or is this normal?


Some extra info:

- HTML of the pages: http://pastebin.com/dyDPJeZs
- Hadoop log: http://pastebin.com/rwQQPnkE
- nutch-site.xml: http://pastebin.com/0WArkvh5
- Wireshark log: http://pastebin.com/g4Bg17Ls
- MySQL table: http://pastebin.com/gD2SvGsy

[1] http://nlp.solutions.asia/?p=180




Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread alxsss
Hello,

I think the problem is with the storage not nutch itself. Looks like generate 
cannot read status or fetch time (or gets null values)  from mysql. 
I had a bunch of issues with mysql storage and switched to hbase at the end.

Alex.

 

 

 

-Original Message-
From: Sebastian Nagel 
To: user 
Sent: Thu, Oct 18, 2012 12:08 pm
Subject: Re: Same pages crawled more than once and slow crawling


Hi Luca,

> I'm using Nutch 2.1 on Linux and I'm having similar problem of 
http://goo.gl/nrDLV, my Nutch is
> fetching same pages at each round.
Um... I failed to reproduce the Pierre's problem with
- a simpler configuration
- HBase as back-end (Pierre and Luca both use mysql)

> Then I ran "bin/nutch crawl urls -threads 1"
>
> first.htm was fetched 5 times
> second.htm was fetched 4 times
> third.htm was fetched 3 times
But after the 5th cycle the crawler stopped?

> I tried doing each step separately (inject, generate, ...) with the same 
results.
For Pierre this has worked...
Any suggestions?

> Also the whole process take about 2 minutes, am I missing something about 
> some 
delay config or is
> this normal?
Well, Nutch (resp. Hadoop) are designed to process much data. Job management 
has 
some overhead
(and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) = 
20 jobs.
6s per job seems roughly ok, though it could be slightly faster.

Sebastian

On 10/18/2012 05:55 PM, Luca Vasarelli wrote:
> Hello,
> 
> I'm using Nutch 2.1 on Linux and I'm having similar problem of 
http://goo.gl/nrDLV, my Nutch is
> fetching same pages at each round.
> 
> I've built a simple localhost site, with 3 pages linked each other:
> first.htm -> second.htm -> third.htm
> 
> I did these steps:
> 
> - downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH}
> - edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql backend 
(thanks to [1])
> - edited ${TEMP_NUTCH}/conf/gora.properties removing default sql 
> configuration 
and adding mysql
> properties (thanks to [1])
> - ran "ant runtime" from ${TEMP_NUTCH}
> - moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
> - edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name, 
http.robots.agents and changing
> db.ignore.external.links to true and fetcher.server.delay to 0.0
> - created ${NUTCH_HOME}/urls/seed.txt with "http://localhost/test/first.htm"; 
inside this file
> - created db table as [1]
> 
> Then I ran "bin/nutch crawl urls -threads 1"
> 
> first.htm was fetched 5 times
> second.htm was fetched 4 times
> third.htm was fetched 3 times
> 
> I tried doing each step separately (inject, generate, ...) with the same 
results.
> 
> Also the whole process take about 2 minutes, am I missing something about 
> some 
delay config or is
> this normal?
> 
> Some extra info:
> 
> - HTML of the pages: http://pastebin.com/dyDPJeZs
> - Hadoop log: http://pastebin.com/rwQQPnkE
> - nutch-site.xml: http://pastebin.com/0WArkvh5
> - Wireshark log: http://pastebin.com/g4Bg17Ls
> - MySQL table: http://pastebin.com/gD2SvGsy
> 
> [1] http://nlp.solutions.asia/?p=180


 


Re: Same pages crawled more than once and slow crawling

2012-10-18 Thread Sebastian Nagel
Hi Luca,

> I'm using Nutch 2.1 on Linux and I'm having similar problem of 
> http://goo.gl/nrDLV, my Nutch is
> fetching same pages at each round.
Um... I failed to reproduce the Pierre's problem with
- a simpler configuration
- HBase as back-end (Pierre and Luca both use mysql)

> Then I ran "bin/nutch crawl urls -threads 1"
>
> first.htm was fetched 5 times
> second.htm was fetched 4 times
> third.htm was fetched 3 times
But after the 5th cycle the crawler stopped?

> I tried doing each step separately (inject, generate, ...) with the same 
> results.
For Pierre this has worked...
Any suggestions?

> Also the whole process take about 2 minutes, am I missing something about 
> some delay config or is
> this normal?
Well, Nutch (resp. Hadoop) are designed to process much data. Job management 
has some overhead
(and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) = 
20 jobs.
6s per job seems roughly ok, though it could be slightly faster.

Sebastian

On 10/18/2012 05:55 PM, Luca Vasarelli wrote:
> Hello,
> 
> I'm using Nutch 2.1 on Linux and I'm having similar problem of 
> http://goo.gl/nrDLV, my Nutch is
> fetching same pages at each round.
> 
> I've built a simple localhost site, with 3 pages linked each other:
> first.htm -> second.htm -> third.htm
> 
> I did these steps:
> 
> - downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH}
> - edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql backend 
> (thanks to [1])
> - edited ${TEMP_NUTCH}/conf/gora.properties removing default sql 
> configuration and adding mysql
> properties (thanks to [1])
> - ran "ant runtime" from ${TEMP_NUTCH}
> - moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
> - edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name, 
> http.robots.agents and changing
> db.ignore.external.links to true and fetcher.server.delay to 0.0
> - created ${NUTCH_HOME}/urls/seed.txt with "http://localhost/test/first.htm"; 
> inside this file
> - created db table as [1]
> 
> Then I ran "bin/nutch crawl urls -threads 1"
> 
> first.htm was fetched 5 times
> second.htm was fetched 4 times
> third.htm was fetched 3 times
> 
> I tried doing each step separately (inject, generate, ...) with the same 
> results.
> 
> Also the whole process take about 2 minutes, am I missing something about 
> some delay config or is
> this normal?
> 
> Some extra info:
> 
> - HTML of the pages: http://pastebin.com/dyDPJeZs
> - Hadoop log: http://pastebin.com/rwQQPnkE
> - nutch-site.xml: http://pastebin.com/0WArkvh5
> - Wireshark log: http://pastebin.com/g4Bg17Ls
> - MySQL table: http://pastebin.com/gD2SvGsy
> 
> [1] http://nlp.solutions.asia/?p=180



Same pages crawled more than once and slow crawling

2012-10-18 Thread Luca Vasarelli

Hello,

I'm using Nutch 2.1 on Linux and I'm having similar problem of 
http://goo.gl/nrDLV, my Nutch is fetching same pages at each round.


I've built a simple localhost site, with 3 pages linked each other:
first.htm -> second.htm -> third.htm

I did these steps:

- downloaded nutch 2.1 (source) & untarred to ${TEMP_NUTCH}
- edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql 
backend (thanks to [1])
- edited ${TEMP_NUTCH}/conf/gora.properties removing default sql 
configuration and adding mysql properties (thanks to [1])

- ran "ant runtime" from ${TEMP_NUTCH}
- moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME}
- edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name, 
http.robots.agents and changing db.ignore.external.links to true and 
fetcher.server.delay to 0.0
- created ${NUTCH_HOME}/urls/seed.txt with 
"http://localhost/test/first.htm"; inside this file

- created db table as [1]

Then I ran "bin/nutch crawl urls -threads 1"

first.htm was fetched 5 times
second.htm was fetched 4 times
third.htm was fetched 3 times

I tried doing each step separately (inject, generate, ...) with the same 
results.


Also the whole process take about 2 minutes, am I missing something 
about some delay config or is this normal?


Some extra info:

- HTML of the pages: http://pastebin.com/dyDPJeZs
- Hadoop log: http://pastebin.com/rwQQPnkE
- nutch-site.xml: http://pastebin.com/0WArkvh5
- Wireshark log: http://pastebin.com/g4Bg17Ls
- MySQL table: http://pastebin.com/gD2SvGsy

[1] http://nlp.solutions.asia/?p=180


building from src

2012-10-18 Thread sumarlidason
Good Morning,

I am working on building nutch from source on centos to be used in
conjunction with solr and hadoop.

So far I have...
download the source, ( http://www.gtlib.gatech.edu/pub/apache/nutch/2.1/ )
built with ant, successfully,
created a bin folder,
download the nutch script, (
https://svn.apache.org/repos/asf/nutch/branches/branch-1.2/bin/nutch )
set three environmental variables:
JAVA_HOME=/usr/java/jdk1.6.0_26/
NUTCH_HOME=/root/Downloads/apache-nutch-2.1/
NUTCH_JAVA_HOME=/usr/java/jdk1.6.0_26/

When attempting to run, i get the following error,

[root@hdpjt01 apache-nutch-2.1]# bin/nutch crawl urls -dir crawl -depth 3
-topN 5
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/util/PlatformName
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.util.PlatformName
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.hadoop.util.PlatformName.  Program
will exit.
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/nutch/crawl/Crawl
Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.nutch.crawl.Crawl.  Program will
exit.

As I drafted this post, i see a possible problem, the script im running is
from a 1.2 branch, where can I get a script for 2.1? When i attempt to run
the jar directly i get other errors. I read somewhere about merging jars so
it could be ran stand alone?

please if someone can assist...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/building-from-src-tp4014501.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Fetcher Thread

2012-10-18 Thread Ye T Thet
Thanks Marcus.

I will remember to set 1 for thread.per.host.

Cheers,

Ye

On Thu, Oct 18, 2012 at 9:55 PM, Markus Jelsma
wrote:

> Hi Ye,
>
> -Original message-
> > From:Ye T Thet 
> > Sent: Thu 18-Oct-2012 15:46
> > To: user@nutch.apache.org
> > Subject: Fetcher Thread
> >
> > Hi Folks,
> >
> > I have two questions about the Fetcher Thread in Nutch. The value
> > fetcher.threads.fetch in configuration file determines the number of
> > threads the Nutch would use to fetch. Of course threads.per.host is also
> > used for politeness.
> >
> > I set 100 for fetcher.threads.fetch and 2 for threads.per.host value. So
> > far on my development I have been using only one linux box to fetch thus
> it
> > is clear that Nutch would fetch 100 urls at time provided that the
> > threads.per.host criteria is met.
> >
> > The questions are:
> >
> > 1. What if I crawl on a hadoop cluster with with 5 linux box and set the
> > fetcher.threads.fetch to 100? Would Nutch fetch 100 url at time or 500
> (5 x
> > 100) at time?
>
> All nodes are isolated and don't know what the other is doing. So if you
> set the threads to 100 for each machine, each machine will run with 100
> threads.
>
> >
> > 2. Any advise on formulating optimum fetcher.threads.fetch and
> > threads.per.host for a hadoop cluster with 5 linux box (Amazon EC2 medium
> > instance, 3.7 GB memory). I would be crawling around 10,000 (10k) web
> > sites.
>
> I think threads per host must not exceed 1 for most websites out of
> politeness. You can set the number of threads as high as you can, it only
> takes more memory. If you parse in the fetcher as well, you can run much
> fewer threads.
>
> >
> > Thanks,
> >
> > Ye
> >
>


RE: Fetcher Thread

2012-10-18 Thread Markus Jelsma
Hi Ye,
 
-Original message-
> From:Ye T Thet 
> Sent: Thu 18-Oct-2012 15:46
> To: user@nutch.apache.org
> Subject: Fetcher Thread
> 
> Hi Folks,
> 
> I have two questions about the Fetcher Thread in Nutch. The value
> fetcher.threads.fetch in configuration file determines the number of
> threads the Nutch would use to fetch. Of course threads.per.host is also
> used for politeness.
> 
> I set 100 for fetcher.threads.fetch and 2 for threads.per.host value. So
> far on my development I have been using only one linux box to fetch thus it
> is clear that Nutch would fetch 100 urls at time provided that the
> threads.per.host criteria is met.
> 
> The questions are:
> 
> 1. What if I crawl on a hadoop cluster with with 5 linux box and set the
> fetcher.threads.fetch to 100? Would Nutch fetch 100 url at time or 500 (5 x
> 100) at time?

All nodes are isolated and don't know what the other is doing. So if you set 
the threads to 100 for each machine, each machine will run with 100 threads.

> 
> 2. Any advise on formulating optimum fetcher.threads.fetch and
> threads.per.host for a hadoop cluster with 5 linux box (Amazon EC2 medium
> instance, 3.7 GB memory). I would be crawling around 10,000 (10k) web
> sites.

I think threads per host must not exceed 1 for most websites out of politeness. 
You can set the number of threads as high as you can, it only takes more 
memory. If you parse in the fetcher as well, you can run much fewer threads.

> 
> Thanks,
> 
> Ye
>