I get it now! Thanks a lot!
I was running my crawl command with fetcher.parse as true which was
creating the problem..
On Thu, Oct 18, 2012 at 5:53 PM, Markus Jelsma-2 [via Lucene] <
ml-node+s472066n4014609...@n3.nabble.com> wrote:
> You would have to check the generator code to make sure. But wh
Kiran,
I took a look at your nutch-site.xml and I did not see anything for
http.accept. I believe nutch-default.xml does not include application/pdf by
default in http.accept so you may need to add it in your nutch-site.xml.
Please take a look at the example below from my nutch-site.xml
h
You would have to check the generator code to make sure. But why would you want
to distribute the queue for a single domain to multiple mappers? A single local
running mapper without parsing on a low-end machine can easily fetch 20-40
records per second from the same domain (if it allows you to
Hi James,
I have increased the limit in nutch-site.xml (
https://github.com/salvager/nutch/blob/master/nutch-site.xml) and i have
created the webpage table based on the fields here (
http://nlp.solutions.asia/?p=180).
The database stills shows the parseStatus as
'org.apache.nutch.parse.Pars
Thanks.. But I thought there would be a way around it..
Is it possible even to have multiple fetch lists generated (for this
problem) at all by tweaking some parameters?
[I am thinking of something like partition.url.mode - byRandom]
--
View this message in context:
http://lucene.472066.n3.n
Hi - the generator tool partitions URL's by host, domain or IP address, they'll
all end up in the same fetch list. Since you're doing only one domain there is
no need to run additional mappers. If you want to crawl them as fast as you can
(and you are allowed to do that) then use only one mapper
off topic. we are talking about an issue with the SQL backend in GORA, not
the performance of Nutch.
Julien
On 18 October 2012 20:28, Stefan Scheffler wrote:
> Hi,
> The problem why nutch is so slow is, that all of the steps uses hadoop
> jobs which takes a long time to start. As well there is
Hi,
The problem why nutch is so slow is, that all of the steps uses hadoop
jobs which takes a long time to start. As well there is somewhere a
hardcoded 3 second delay in the hadoop core which makes sense in
distributed systems. But not on single machines.
Regards
stefan
Am 18.10.2012 17:55,
Hello,
I think the problem is with the storage not nutch itself. Looks like generate
cannot read status or fetch time (or gets null values) from mysql.
I had a bunch of issues with mysql storage and switched to hbase at the end.
Alex.
-Original Message-
From: Sebastian Nagel
Hi Luca,
> I'm using Nutch 2.1 on Linux and I'm having similar problem of
> http://goo.gl/nrDLV, my Nutch is
> fetching same pages at each round.
Um... I failed to reproduce the Pierre's problem with
- a simpler configuration
- HBase as back-end (Pierre and Luca both use mysql)
> Then I ran "bin
Hello,
I'm using Nutch 2.1 on Linux and I'm having similar problem of
http://goo.gl/nrDLV, my Nutch is fetching same pages at each round.
I've built a simple localhost site, with 3 pages linked each other:
first.htm -> second.htm -> third.htm
I did these steps:
- downloaded nutch 2.1 (source
Good Morning,
I am working on building nutch from source on centos to be used in
conjunction with solr and hadoop.
So far I have...
download the source, ( http://www.gtlib.gatech.edu/pub/apache/nutch/2.1/ )
built with ant, successfully,
created a bin folder,
download the nutch script, (
https://s
Thanks Marcus.
I will remember to set 1 for thread.per.host.
Cheers,
Ye
On Thu, Oct 18, 2012 at 9:55 PM, Markus Jelsma
wrote:
> Hi Ye,
>
> -Original message-
> > From:Ye T Thet
> > Sent: Thu 18-Oct-2012 15:46
> > To: user@nutch.apache.org
> > Subject: Fetcher Thread
> >
> > Hi Folks,
Hi Ye,
-Original message-
> From:Ye T Thet
> Sent: Thu 18-Oct-2012 15:46
> To: user@nutch.apache.org
> Subject: Fetcher Thread
>
> Hi Folks,
>
> I have two questions about the Fetcher Thread in Nutch. The value
> fetcher.threads.fetch in configuration file determines the number of
> th
14 matches
Mail list logo