Re: Hardware Planning

2007-11-29 Thread VK
Have you considered EC2 + S3? Also Rightscale has some interesting solutions, which I am currently evaluating. On Nov 28, 2007 9:38 PM, Paul Stewart [EMAIL PROTECTED] wrote: Hi folks... I have read the archives and looking for input specific to my estimated requirements: Want to index

Re: can't find hadoop classes necessary to use Nutch API

2007-11-29 Thread Sami Siren
Ana Rodighiero wrote: I have Nutch running on my server and it crawls and searches just fine. I am writing a java program to use the search api, but cannot compile because I am missing some classes from hadoop. Are these classes included somewhere in the nutch or tomcat downloads? If not, how

Re: very poor fetch performance with nutch .8

2007-11-29 Thread Josh Attenberg
I am using the same settings i have used with nutch .7, are there any settings i could tweak to optimize my performance? On Nov 28, 2007 2:50 PM, Josh Attenberg [EMAIL PROTECTED] wrote: previously, I had been using nutch .7. when crawling I have achieved very good rates, ~100 pages per second,

RE: Hardware Planning

2007-11-29 Thread Ken Krugler
Hi Paul, Leaving aside the hardware requirements for the crawl... The main issue with what you need to achieve your is the nature of your index. If you're using the results of a standard Nutch web crawl, then search times 500ms shouldn't be a problem. But you actually want something more

Basic question about indexing

2007-11-29 Thread Venkat Korvi
Hi folks, I would deeply appreciate if someone can shed light on how to solve a specific search I am trying to accomplish with Nutch. I am currently ABLE to do the following: Use Nutch to crawl a directory in the local filesystem ( linux) (The local directory has html files) When I run

Merge indexes using nutch v 0.9

2007-11-29 Thread Cool Coder
I have generated lots of indexes for individual site using nutch and was looking for a way to merge all indexes into one index to be used in live system. I was really struggling to merge them all and finally I could able to find the way. Here are the steps Lets say, you have two working

nutch programmer needed for custom scoring plugin

2007-11-29 Thread ronjonbb
Hi - Im looking to hire a programmer to build a custom opicscoringfilter plugin(s) for both at index and at querytime. (boost documents flagged important or not important, from a mysql table). Ive done alot of research and it seems to be easy, but im not much of a java programmer. Or if its

RE: Hardware Planning

2007-11-29 Thread Paul Stewart
Thanks very much for the details... I appreciate it... I'd be happy with the 500ms range on *average* but totally understand your point about searches piling up So you're suggesting about 20 million pages per box - each box with 4 drives, dual CPU and 4 gig RAM? I guess what I don't totally

maintainability of nutch - building incremental index

2007-11-29 Thread Koe Black
Hello All, I have a question. our scenario is: 1. Crawl the db initially and set the index to run the application 2. Add new URLs to webdb (not a problem) 3. we want to just crawl new websites (from step 2) and add the searched result into the db and index created initially - we want to do this

Fetching site's sub-folders only

2007-11-29 Thread peashey
Hi, We want to fetch all sub-folders of a site, excluding outside links and pages that are on upper levels. Is there there a way to setup Nutch making it work in such mode, without using the RegExp filter file (regex-urlfilter.txt)? Thank you. -- View this message in context: