Have you considered EC2 + S3?
Also Rightscale has some interesting solutions, which I am currently
evaluating.
On Nov 28, 2007 9:38 PM, Paul Stewart [EMAIL PROTECTED] wrote:
Hi folks...
I have read the archives and looking for input specific to my estimated
requirements:
Want to index
Ana Rodighiero wrote:
I have Nutch running on my server and it crawls and searches just fine. I am
writing a java program to use the search api, but cannot compile because I
am missing some classes from hadoop. Are these classes included somewhere in
the nutch or tomcat downloads? If not, how
I am using the same settings i have used with nutch .7, are there any
settings i could tweak to optimize my performance?
On Nov 28, 2007 2:50 PM, Josh Attenberg [EMAIL PROTECTED] wrote:
previously, I had been using nutch .7. when crawling I have achieved very
good rates, ~100 pages per second,
Hi Paul,
Leaving aside the hardware requirements for the crawl...
The main issue with what you need to achieve your is the nature of
your index. If you're using the results of a standard Nutch web
crawl, then search times 500ms shouldn't be a problem.
But you actually want something more
Hi folks,
I would deeply appreciate if someone can shed light on how to solve a
specific search I am trying to accomplish with
Nutch.
I am currently ABLE to do the following:
Use Nutch to crawl a directory in the local filesystem ( linux) (The local
directory has html files)
When I run
I have generated lots of indexes for individual site using nutch and was
looking for a way to merge all indexes into one index to be used in live
system. I was really struggling to merge them all and finally I could able to
find the way. Here are the steps
Lets say, you have two working
Hi -
Im looking to hire a programmer to build a custom opicscoringfilter
plugin(s) for both at index and at querytime. (boost documents flagged
important or not important, from a mysql table).
Ive done alot of research and it seems to be easy, but im not much of a java
programmer. Or if its
Thanks very much for the details... I appreciate it...
I'd be happy with the 500ms range on *average* but totally understand
your point about searches piling up
So you're suggesting about 20 million pages per box - each box with 4
drives, dual CPU and 4 gig RAM?
I guess what I don't totally
Hello All,
I have a question.
our scenario is:
1. Crawl the db initially and set the index to run the
application
2. Add new URLs to webdb (not a problem)
3. we want to just crawl new websites (from step 2)
and add the searched result into the db and index
created initially - we want to do this
Hi,
We want to fetch all sub-folders of a site, excluding outside links and
pages that are on upper levels.
Is there there a way to setup Nutch making it work in such mode, without
using the RegExp filter file (regex-urlfilter.txt)?
Thank you.
--
View this message in context:
10 matches
Mail list logo