Hi Manoharam,
You can use the parse command to parse a segment after it is fetched with
-noParsing option. The result will be equivalent to running fetch without
the noparsing option.
In your nutch installation directory, try the command bin/nutch. It will
give you the usage for the parse
I am crawling pages using the following commands in a loop iterating 10 times:-
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
seg1=`ls -d crawl/segments/* | tail -1`
bin/nutch fetch $seg1 -threads 50
bin/nutch updatedb crawl/crawldb $seg1
I am getting the following
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
I am crawling pages using the following commands in a loop iterating 10
times:-
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
seg1=`ls -d crawl/segments/* | tail -1`
bin/nutch fetch $seg1 -threads 50
bin/nutch
Thanks.
I do my crawl using the Intranet Recrawl script available in the wiki.
I have put these statements in a loop iterating 10 times.
1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000
2. seg1=`ls -d crawl/segments/* | tail -1`
3. bin/nutch fetch $seg1 -threads 50
4. bin/nutch
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Thanks.
I do my crawl using the Intranet Recrawl script available in the wiki.
I have put these statements in a loop iterating 10 times.
1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000
2. seg1=`ls -d crawl/segments/* |
Some confusions regarding plugins.includes
1. I find a parse-oo in the plugins folder. What is that for?
2. I have enabled parse-pdf by including in plugins.include of
nutch-site.xml. The pages now come in the search result. But when I
visit the cached page of the result. It shows a message like
I want to use two filters one for crawling and another for searching
through search.jsp.
I am currently using regex-urlfilter.txt for generate, fetch, update
cycle. But when a user searches the sites, I want him not to see
certain sites in the results that have been crawled.
How can this be
On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
On 5/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
Hi everyone,
Has anyone tried Fetcher2 from latest trunk? On our tests, Fetcher2 is
always slower (by a large margin) that Fetcher.
- Original Message -
From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Thursday, May 31, 2007 2:25 PM
Are you running jobs in the local mode? In distributed mode filtering is
naturally parallel, because you have as many concurrent lookups as there
are map tasks.
I'm just using the
Maybe I did'nt or maybe I did :-) If you want to filter out certain
sites, what is then the problem with doing it in the query?
My query is wrong however, but adding '-site:' in front of every site
you want to exclude will do the trick. I am sure there is another
smarter way aswell.
I do not
We setup an /etc/resolv.conf configuration as shown below. This allows
us to check first local then two of the major DNS caches on the internet
before requesting it through a local DNS caching server. The 208
addresses are OpenDNS servers and the 4.x addresses are Verizon DNS
servers. All
Doğacan Güney wrote:
I am still not sure about the source of this bug, but I think I found
some unnecessary waits in Fetcher2. Even if a url is blocked by
robots.txt (or has a crawl delay larger that max.crawl.delay),
Fetcher2 still waits fetcher.server.delay before fetching another url
from
Enzo Michelangeli wrote:
I'm just using the vanilla (local) configuration. The situation is so
bad that lately I'm seeing durations like:
generate: 2h 48' (-topN 2)
fetch:1h 40' (200 threads)
updatedb: 2h 20'
This because both generate and updatedb perform filtering, and are
On 5/31/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Doğacan Güney wrote:
I am still not sure about the source of this bug, but I think I found
some unnecessary waits in Fetcher2. Even if a url is blocked by
robots.txt (or has a crawl delay larger that max.crawl.delay),
Fetcher2 still
Hi,
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Some confusions regarding plugins.includes
1. I find a parse-oo in the plugins folder. What is that for?
Plugin parse-oo has something to do with parsing OpenOffice.org
documents, I am not sure what exactly.
2. I have enabled
- Original Message -
From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Thursday, May 31, 2007 11:39 PM
Caching seems to be the only solution. Even if you were able to fire DNS
requests more rapidly, remote servers wouldn't be able (or wouldn't like
to) respond that quickly ...
Then why
Doğacan Güney wrote:
Good catch! The patch looks good, too - please go ahead. One question:
why did you remove the call to finishFetchItem() around line 505?
Because it seems we already call finishFetchItem in that code path
just before the switch statement. I have opened NUTCH-495 for this,
After some tweaking and playing I've found a couple of things.
I moved the mapred.map.tasks and the mapred.reduce.tasks configuration
settings into hadoop-site.xml from a mapred.xml file (maybe
mapred-default.xml, I forget now). Also, my crawl only seems to work
when using the value of 20 for
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Thursday, May 31, 2007 11:39 PM
Caching seems to be the only solution. Even if you were able to fire DNS
requests more rapidly, remote servers wouldn't be able (or wouldn't like
to) respond that quickly ...
Then why is
This message was created automatically by mail delivery software.
A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:
[EMAIL PROTECTED]
SMTP error from remote mailer after RCPT TO:[EMAIL PROTECTED]:
Enzo Michelangeli wrote:
- Original Message - From: Andrzej Bialecki [EMAIL PROTECTED]
Sent: Thursday, May 31, 2007 11:39 PM
Caching seems to be the only solution. Even if you were able to fire DNS
requests more rapidly, remote servers wouldn't be able (or wouldn't like
to)
21 matches
Mail list logo