Hello!
Monday, February 28, 2005, 1:53:07 AM, you wrote:
Y> 2. How to get the number of pages in DB?
try "bin/nutch readdb dbpath -stats"
--
Best regards, mailto:[EMAIL PROTECTED]
Chernodarov Egor
--
Hi Paul,
the general list would indeed be more appropriate, but
let's finish the thread here. That makes it easier for everyone.
It sounds like you have not been building your index yet. I
recommend following the tutorial steps closely, if you are a
first time user. After you built the index, you
It turns out that Apache won't import our CVS history into SVN. So,
unless someone argues against it, tomorrow I will:
- disable commits to Nutch's Sourceforge CVS repository
- rename all packages from net.nutch to org.apache.nutch
- change the license to Apache
- commit Nutch's sources to A
Hi Olaf,
Thanks for the reply. I am trying the same crawl with depth 5, and it
has been going on for about 3 hours. So its a good sign.
I read in the the nutch website somewhere that even if the crawl is
not finished, you can use the results from whatever it has. I tried
doing just that, and all
Hi Shri,
if you look at regex-normalize.xml you will notice
that jsession ids are not taken out as a default.
Your session id seems to be quite long. If you are
sure of the format, you may try inserting the following
rule
(;)jsessionid=[a-zA-Z0-9!-]{64}(?)
Are all links of the same style or
Hi Will,
I agree with you that this is an intriguing idea, but
1. Look at http://www.eurekster.com/
2. Privacy Issues (see gmail)
3. Too many sites
If you feel nevertheless inclined towards programming
this, go ahead. I will be one of the first to test your new
interface and give feedback.
Kind
Hi Paul,
the most likely problem seems to me the depth of 15.
If your first page and every consecutive one had 10 links,
your crawler would have to fetch roughly 24414062500
GigyByte from the Internet.
Depending on your data, start with a much smaller depth.
Kind regards,
Olaf
On Mon, 28 Feb
Sorry for the inconvenience. I just fixed this bug in CVS, although it
may take it a few hours to reach the public CVS. Look for version 1.7 at:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/io/WritableComparator.java
Sorry. My bad.
Doug
-
Bugs item #1152281, was opened at 2005-02-26 02:32
Message generated for change (Comment added) made by cutting
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1152281&group_id=59548
Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority:
Hello All,
I was running an intranet crawl and It seems like it did not finish, properly.
It is a pretty default setup, but crawl's depth was 15, and I had
turned on queries by commenting out
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
other than bunch o
Matthias Jaekle wrote:
Thanks Andrzej!
Matthias
Andrzej Bialecki schrieb:
Matthias Jaekle wrote:
Hi Andrzej,
I have copied the 2 segments to: http://www.iventax.de/NEW/
You might wget -r them. Both of them are around 2 GB.
If I should tar them or something else, please let me know.
Thanks for your
[EMAIL PROTECTED]
¡¡-
[EMAIL PROTECTED]@[EMAIL PROTECTED]@f«EpXgEnCq[
[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL
PROTECTED]@[EMAIL
PROTECTED]@`ßéßtF`æE®æÌóÉ`--
Hi,
I tried to index a segment with around 5.000.000 urls.
This is running with 0.31 rec/s.
Seems to bee very slow.
When I was running indexing on smaller segments (500.000 docs) it seemed
to be much faster.
Are there any explanations?
I use a 1.4 Ghz Athlon with 1 GB RAM.
What is a good segment
[EMAIL PROTECTED]
¡¡-
[EMAIL PROTECTED]@[EMAIL PROTECTED]@f«EpXgEnCq[
[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL
PROTECTED]@[EMAIL
PROTECTED]@`ßéßtF`æE®æÌóÉ`--
14 matches
Mail list logo