Hi,
I found that with a 3 meg DSL line I was averaging 8 pages per second with a
similar set up, to reach 100 million pages would take about 144 days.
100,000,000 / 8 pages per second / 60 seconds per minute / 60 minutes per
hour / 24 hours in a day.
Just a FYI rule of thumb on a qwest DSL lin
Kerry,
I am completely on windows with nutch, use cygwin. If you have other
questions give me a shout. r/d
-Original Message-
From: Kerry Wilson [mailto:[EMAIL PROTECTED]
Sent: Friday, July 14, 2006 11:50 AM
To: nutch-user@lucene.apache.org
Subject: Nutch on Windows
Trying to use nutch
as
On 6/3/06, Dan Morrill <[EMAIL PROTECTED]> wrote:
> Well I can do the project management side of it, and can volunteer some
> time, but have never done this in an open source model before. But I can
do
> documentation, project management support, and make a decent cheer leader
a
nent?
Now, the question is how we can support several people working on this
from a "project management" or code management perspective?
I mean, if we want the Sandbox to flourish, we need some kind of
infrastructure, right?
Rgrds, Thomas Delnoij
On 6/3/06, Dan Morrill <[
Sounds like everyone, even me is interested in being able to provide this
service.
If the process requires that we break it off of nutch code, what all would
be required to make this happen?
r/d
-Original Message-
From: Zaheed Haque [mailto:[EMAIL PROTECTED]
Sent: Saturday, June 03, 2
>
>> -Neges Wreiddiol-/-Original Message-
>> Oddi wrth/From: Dan Morrill [mailto:[EMAIL PROTECTED]
>> Anfonwyd/Sent: 28 April 2006 14:07
>> At/To: nutch-user@lucene.apache.org
>> Pwnc/Subject: RE: Heritrix
>>
>> Aled,
>>
>> I
: Heritrix
Thanks for your replies guys. I hadn't realised that the admin gui was
already in development.
We should be able to cope till it gets released ;-)
Thanks again
Aled
> -Neges Wreiddiol-/-Original Message-
> Oddi wrth/From: Dan Morrill [mailto:[EMAI
Aled,
I used heritrix before going over to nutch, while it is an excellent
program, with lots of good things to offer, it didn't quite meet my need,
and when designing the architecture had too many dependencies for me to be
comfortable with.
If you want to run an internet archive though, heritri
How's about one in seattle?
r/d
-Original Message-
From: David Webster [mailto:[EMAIL PROTECTED]
Sent: Saturday, April 22, 2006 12:29 AM
To: nutch-user@lucene.apache.org
Subject: Re: yes, a European nutch meeting is also planed :)
On Sat, 22 Apr 2006 09:33:59 +0530, Arun Kaundal wrote:
This is for the Nutch crew,
I was reading in the paper this morning that you (the Nutch group) were
looking to build a 1 billion URL database, and while I only have some 10
million URLS, I will happily share my crawl if you are still wanting to do
this project. Admitted it is just 1% of what yo
Just want to ask if anyone else has noticed that the index and segments
under the searcher dir are causing a hot spot on the hard drive in a heavy
transaction use search.
I am on windows, Nutch 7.1, tomcat 5.15, and have tuned the system for some
decent performance, Modified both tomcat and Nu
Fabrice,
Personally I am tailing the crawl log to find that out. About every 100
pages it gives out the amount of pages in total and pages per second and
line speed.
Hope that helps.
r/d
-Original Message-
From: Fabrice EstiƩvenart [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 05,
Hi,
I noticed that when I used the drive designation that it didn't like that
(windows cygwin environment) if you did
./nutch merge -local /STG1/index /STG1/indexes that may work better, let me
know.
Cheers/r/dan
-Original Message-
From: Vertical Search [mailto:[EMAIL PROTECTED]
Sent:
Andrzej,
Cheers! Good to know. Thanks!
r/d
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 02, 2006 5:01 PM
To: nutch-user@lucene.apache.org
Subject: Re: hi all
Dan Morrill wrote:
> Since you are using Luke to see the index, luke may not have
Did you:
1. remove the root.war from tomcat?
2. rename nutch.war to root.war and dump that into webapps under tomcat?
3. did it install ok (can you see the exploded pages under webapps root?
Just checking, this is how I fixed the same issue under windows.
r/d
-Original Message-
From: P
Subject: Re: hi all
thx for advice!
now i know what's up.
but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the
LUKE to see the index, ant there are messy character when crawl the Chinese
webs.
so ,how can i deal with it??
any reply will be appreciated.
On 4/2/06
Good Morning Kauu,
I have noticed that Nutch only knows about UTF-8 character codes, so the
simplified Chinese character set is UTF-8 and should come out ok. If the
crawl sees Chinese in a non-utf-8, the web site may be serving them under an
older ISO standard, or you may not have the language pac
Do you have that shell script?
On 3/30/06, Dan Morrill <[EMAIL PROTECTED]> wrote:
> Hi folks,
>
> It worked, it worked great, I made a shell script to do the work for me.
> Thank you, thank you, and again, thank you.
>
> r/d
>
> -Original Message-
> From: Dan
Thanks folks, I ran a field test today, and just slotted in the new index on
the test server. If anyone wants to take a look at it, the FE is not
customized yet, but the help has been excellent, and just want to show off
the work.
Its at http://71.35.163.79/index.jsp if you are interested. The
Hi folks,
It worked, it worked great, I made a shell script to do the work for me.
Thank you, thank you, and again, thank you.
r/d
-Original Message-
From: Dan Morrill [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 30, 2006 5:12 AM
To: nutch-user@lucene.apache.org
Subject: RE
If I remember it correctly, google as been sued and won a number of times on
this issue, you can cache, you can search others web sites, grocklaw has the
data on this one, but I know you can search, you can cache under fair use,
and the idea of public access, as long as you are not cracking passwor
folder is only used when
fetching).
Now just set the searcher.dir property value in nutch-site.xml to be the
location of search.dir
That's how I've been doing it, although it may not be the "right" way.
:-) Hope this helps.
Cheers
Aled
> -Neges Wreiddiol-----/-Ori
Hi folks,
I have 3 crawls, crawlA, crawlB, and crawlC. I would like all of them to be
available to the search.jsp page.
I went through the site saw merge, index, make new db, and followed all the
directions that I could find, but still no resolution on this one. So what I
need are some ide
23 matches
Mail list logo