RE: How long to get 100 million page

2006-08-23 Thread Dan Morrill
Hi, I found that with a 3 meg DSL line I was averaging 8 pages per second with a similar set up, to reach 100 million pages would take about 144 days. 100,000,000 / 8 pages per second / 60 seconds per minute / 60 minutes per hour / 24 hours in a day. Just a FYI rule of thumb on a qwest DSL lin

RE: Nutch on Windows

2006-07-14 Thread Dan Morrill
Kerry, I am completely on windows with nutch, use cygwin. If you have other questions give me a shout. r/d -Original Message- From: Kerry Wilson [mailto:[EMAIL PROTECTED] Sent: Friday, July 14, 2006 11:50 AM To: nutch-user@lucene.apache.org Subject: Nutch on Windows Trying to use nutch

RE: Re[2]: Image Search

2006-06-03 Thread Dan Morrill
as On 6/3/06, Dan Morrill <[EMAIL PROTECTED]> wrote: > Well I can do the project management side of it, and can volunteer some > time, but have never done this in an open source model before. But I can do > documentation, project management support, and make a decent cheer leader a

RE: Re[2]: Image Search

2006-06-03 Thread Dan Morrill
nent? Now, the question is how we can support several people working on this from a "project management" or code management perspective? I mean, if we want the Sandbox to flourish, we need some kind of infrastructure, right? Rgrds, Thomas Delnoij On 6/3/06, Dan Morrill <[

RE: Re[2]: Image Search

2006-06-03 Thread Dan Morrill
Sounds like everyone, even me is interested in being able to provide this service. If the process requires that we break it off of nutch code, what all would be required to make this happen? r/d -Original Message- From: Zaheed Haque [mailto:[EMAIL PROTECTED] Sent: Saturday, June 03, 2

RE: Admin Gui beta test (was Re: ATB: Heritrix)

2006-04-28 Thread Dan Morrill
> >> -Neges Wreiddiol-/-Original Message- >> Oddi wrth/From: Dan Morrill [mailto:[EMAIL PROTECTED] >> Anfonwyd/Sent: 28 April 2006 14:07 >> At/To: nutch-user@lucene.apache.org >> Pwnc/Subject: RE: Heritrix >> >> Aled, >> >> I

RE: Heritrix

2006-04-28 Thread Dan Morrill
: Heritrix Thanks for your replies guys. I hadn't realised that the admin gui was already in development. We should be able to cope till it gets released ;-) Thanks again Aled > -Neges Wreiddiol-/-Original Message- > Oddi wrth/From: Dan Morrill [mailto:[EMAI

RE: Heritrix

2006-04-28 Thread Dan Morrill
Aled, I used heritrix before going over to nutch, while it is an excellent program, with lots of good things to offer, it didn't quite meet my need, and when designing the architecture had too many dependencies for me to be comfortable with. If you want to run an internet archive though, heritri

RE: yes, a European nutch meeting is also planed :)

2006-04-22 Thread Dan Morrill
How's about one in seattle? r/d -Original Message- From: David Webster [mailto:[EMAIL PROTECTED] Sent: Saturday, April 22, 2006 12:29 AM To: nutch-user@lucene.apache.org Subject: Re: yes, a European nutch meeting is also planed :) On Sat, 22 Apr 2006 09:33:59 +0530, Arun Kaundal wrote:

Quesiton about Nutch needs for crawl data

2006-04-10 Thread Dan Morrill
This is for the Nutch crew, I was reading in the paper this morning that you (the Nutch group) were looking to build a 1 billion URL database, and while I only have some 10 million URLS, I will happily share my crawl if you are still wanting to do this project. Admitted it is just 1% of what yo

Nutch search and hard drive hot spots

2006-04-09 Thread Dan Morrill
Just want to ask if anyone else has noticed that the index and segments under the searcher dir are causing a hot spot on the hard drive in a heavy transaction use search. I am on windows, Nutch 7.1, tomcat 5.15, and have tuned the system for some decent performance, Modified both tomcat and Nu

RE: Crawl status

2006-04-05 Thread Dan Morrill
Fabrice, Personally I am tailing the crawl log to find that out. About every 100 pages it gives out the amount of pages in total and pages per second and line speed. Hope that helps. r/d -Original Message- From: Fabrice EstiƩvenart [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 05,

RE: Merging indexes -- please help....

2006-04-03 Thread Dan Morrill
Hi, I noticed that when I used the drive designation that it didn't like that (windows cygwin environment) if you did ./nutch merge -local /STG1/index /STG1/indexes that may work better, let me know. Cheers/r/dan -Original Message- From: Vertical Search [mailto:[EMAIL PROTECTED] Sent:

RE: hi all

2006-04-02 Thread Dan Morrill
Andrzej, Cheers! Good to know. Thanks! r/d -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 5:01 PM To: nutch-user@lucene.apache.org Subject: Re: hi all Dan Morrill wrote: > Since you are using Luke to see the index, luke may not have

RE: Problems Installing

2006-04-02 Thread Dan Morrill
Did you: 1. remove the root.war from tomcat? 2. rename nutch.war to root.war and dump that into webapps under tomcat? 3. did it install ok (can you see the exploded pages under webapps root? Just checking, this is how I fixed the same issue under windows. r/d -Original Message- From: P

RE: hi all

2006-04-02 Thread Dan Morrill
Subject: Re: hi all thx for advice! now i know what's up. but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the LUKE to see the index, ant there are messy character when crawl the Chinese webs. so ,how can i deal with it?? any reply will be appreciated. On 4/2/06

RE: hi all

2006-04-02 Thread Dan Morrill
Good Morning Kauu, I have noticed that Nutch only knows about UTF-8 character codes, so the simplified Chinese character set is UTF-8 and should come out ok. If the crawl sees Chinese in a non-utf-8, the web site may be serving them under an older ISO standard, or you may not have the language pac

RE: Multiple crawls how to get them to work together

2006-04-02 Thread Dan Morrill
Do you have that shell script? On 3/30/06, Dan Morrill <[EMAIL PROTECTED]> wrote: > Hi folks, > > It worked, it worked great, I made a shell script to do the work for me. > Thank you, thank you, and again, thank you. > > r/d > > -Original Message- > From: Dan

Got it up and running

2006-03-30 Thread Dan Morrill
Thanks folks, I ran a field test today, and just slotted in the new index on the test server. If anyone wants to take a look at it, the FE is not customized yet, but the help has been excellent, and just want to show off the work. Its at http://71.35.163.79/index.jsp if you are interested. The

RE: Multiple crawls how to get them to work together

2006-03-30 Thread Dan Morrill
Hi folks, It worked, it worked great, I made a shell script to do the work for me. Thank you, thank you, and again, thank you. r/d -Original Message- From: Dan Morrill [mailto:[EMAIL PROTECTED] Sent: Thursday, March 30, 2006 5:12 AM To: nutch-user@lucene.apache.org Subject: RE

RE: Legal issues

2006-03-30 Thread Dan Morrill
If I remember it correctly, google as been sued and won a number of times on this issue, you can cache, you can search others web sites, grocklaw has the data on this one, but I know you can search, you can cache under fair use, and the idea of public access, as long as you are not cracking passwor

RE: Multiple crawls how to get them to work together

2006-03-30 Thread Dan Morrill
folder is only used when fetching). Now just set the searcher.dir property value in nutch-site.xml to be the location of search.dir That's how I've been doing it, although it may not be the "right" way. :-) Hope this helps. Cheers Aled > -Neges Wreiddiol-----/-Ori

Multiple crawls how to get them to work together

2006-03-29 Thread Dan Morrill
Hi folks, I have 3 crawls, crawlA, crawlB, and crawlC. I would like all of them to be available to the search.jsp page. I went through the site saw merge, index, make new db, and followed all the directions that I could find, but still no resolution on this one. So what I need are some ide