Re: multiple sites run

2007-07-05 Thread Insurance Squared Inc.
Your conclusion is incorrect. You can just call the URL with the search term from within a PHP script for example. Mozdex will return XML formatted results. Just take the results and format them within your page. No link or other mention of mozdex is required. It does do exactly what you'r

Re: integrate Nutch into my php front page

2007-06-30 Thread Insurance Squared Inc.
7) at org.apache.nutch.crawl.Injector.inject(Injector.java:138) at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) ~~~ Daniel Clark, President DAC Systems, Inc. 5209 Nanticoke Court Centreville, VA 20120 Cell - (703) 403-0340 Email - [EMAIL PROTECTED]

nutch books

2007-04-14 Thread Insurance Squared Inc.
I've got some nutch related books I'm looking to clear off my bookshelf. If you're interested, I've posted them on Craigslist here: http://kitchener.craigslist.org/bks/311929785.html

Re: Wikia Search Engine? Anyone working on it?

2007-03-26 Thread Insurance Squared Inc.
Personally I suspect a sheep in wolves clothing. I see this as another move intended to monetize. Nothing wrong with that, unless everyone believes what they're doing isn't for that purpose. Enis Soztutar wrote: Sean Dean wrote: Ive been following it, but haven't posted anything over there.

Re: alternative for dmoz rdf ?

2007-01-13 Thread Insurance Squared Inc.
Actually, I believe this information is available online if you do some digging - but only for some TLD's. I'm pretty sure the .com info and likely .org and .net is readily available. I know the .ca's are not available. I don't recall any specific licensing issues the last time I looked. g

Re: search performance

2006-12-29 Thread Insurance Squared Inc.
chael g. Michael Wechner wrote: Insurance Squared Inc. wrote: Make sure you don't have any empty or bad segments. We had some serious speed issues for a long time until we realized we had some empty segments that had been generated as we tested. Nutch would then sit and spin on the

Re: search performance

2006-12-29 Thread Insurance Squared Inc.
If I recall correctly, we just checked the segment directories for space size. The bad ones had files of only 32K or something like that. g. Michael Wechner wrote: Insurance Squared Inc. wrote: Make sure you don't have any empty or bad segments. We had some serious speed issues

Re: search performance

2006-12-29 Thread Insurance Squared Inc.
Make sure you don't have any empty or bad segments. We had some serious speed issues for a long time until we realized we had some empty segments that had been generated as we tested. Nutch would then sit and spin on these bad segments for a few seconds on every search. Simply deleting the

Re: New Wikipedia search engine using Nutch

2006-12-26 Thread Insurance Squared Inc.
Interesting. There goes the premise that Wiki is not for profit. e w wrote: Haven't seen anyone mention this on the lists yet but is probably of interest to the community: http://www.techcrunch.com/2006/12/23/wikipedia-to-launch-searchengine-exclusive-screenshot/

Re: lucene/nutch investigation

2006-12-05 Thread Insurance Squared Inc.
Hi Bruce, This list is not only very active - it's full of people constantly giving helpful, instructive answers. If you've got questions, this is the place. I would say based on my experience that nutch is a) excellent and b) not for the faint of heart when it comes to java - you'll need s

nutch-xml.conf

2006-08-13 Thread Insurance Squared Inc.
I've lost the thread, but someone here had recently asked for our nutch xml configuration file. Our developer's back from holidays so I've got the info now. Note that some of the configuration variables are not in the default file as we've made modifications. On our dual xeon, 8gigs of ram,

Re: Crawling the entire web -- what's involved?

2006-08-09 Thread Insurance Squared Inc.
As a second indicator of the scale, IIRC Doug Cutting posted a while ago that he downloaded and indexed 50 million pages in a day or two with about 10 servers. We download about 100,000 pages per hour on a dedicated 10mbs connection. Nutch will definitely fill more than a 10mbs connection

Re: Crawling the entire web -- what's involved?

2006-08-09 Thread Insurance Squared Inc.
Well, just very roughly: 4billion pages X 20K per page / 1000K per meg / 1000 megs per gig = 80,000 gigs of data transfer every month. 100mbs connection /8 megabits per megabyte * 60 seconds in a minute * 60seconds in an hour*24 hours in a day *30 hours in a month=32,400 gigs per month. So

Re: Search with sponsored ads?

2006-08-07 Thread Insurance Squared Inc.
I've been using a paid (but low cost) script by smarterscripts.com to display ads. It's not the ideal solution but works for lower volume of searches (it's a php/mysql solution). There's an opensource script called phpadsnew that may also work but IIRC it didn't allow for third party adverti

Re: Add Wyona to the wiki support page?

2006-06-21 Thread Insurance Squared Inc.
Well so much for knee-jerk suspicions as to intent. No need to look for conspiracy theories when default settings are more likely to be the cause. That should probably a corollary to occam's razor or something :). Andrzej Bialecki wrote: Insurance Squared Inc. wrote: The funny

Re: Add Wyona to the wiki support page?

2006-06-21 Thread Insurance Squared Inc.
The funny thing about that wiki page (and some others in that area) is that they apparently use the nofollow tags. Given the topic of that wiki, isn't that a bit odd? I personally dislike the nofollow tag and think it should be used only in extreme circumstances (i.e. here's a link to a site

Ignore! [Fwd: any java/tomcat experts in the crowd?]

2006-05-25 Thread Insurance Squared Inc.
Please ignore my last message. My apologies - I meant to send this off to my local linux users group instead of this one. Original Message Subject:any java/tomcat experts in the crowd? Date: Thu, 25 May 2006 18:00:18 -0400 From: Insurance Squared Inc. <[EM

any java/tomcat experts in the crowd?

2006-05-25 Thread Insurance Squared Inc.
Are there any java/tomcat setup experts in the crowd who'd be able to take a (paid) look at my webserver and help with some setup problems? We've got a java application running on one website, have just installed a second website with a slightly modified version of the same application, and ca

two nutch indexes on same webserver

2006-05-25 Thread Insurance Squared Inc.
Hi All, We've been running nutch on one website on our server, we've just added a second website that's running nutch on a seperate index/crawl/segments. We're experiencing some difficulty getting things running and seperating the two. Not entirely sure that the difficulty is with nutch or

Re: changing ranking

2006-05-20 Thread Insurance Squared Inc.
Could I trouble anyone to post a link to the scoring API documentation, as well as the "paper that underlies the current Nutch implementation"? I've dipped into the docs in a few places and haven't bumped into either of these documents. Thanks, g. Ken Krugler wrote: Eugen Kochuev wrote:

Re: Boost for inbound links

2006-05-15 Thread Insurance Squared Inc.
s' i.e. other sites within the database, as part of the ranking. Thanks, g. Insurance Squared Inc. wrote: Is there any good way to boost for the number of inbound links to a page? I guess we can't use PR as that's patented, but I thought that we could somehow boost based on

Boost for inbound links

2006-05-15 Thread Insurance Squared Inc.
Is there any good way to boost for the number of inbound links to a page? I guess we can't use PR as that's patented, but I thought that we could somehow boost based on the number of inbound links. Upon looking at the conf though (and my developer's reply) it doesn't seem like we can do this.

modifying inbound link text calc

2006-05-15 Thread Insurance Squared Inc.
I'm trying to get rid of some spammy sites in our index. First, I wonder if anyone has any suggestions on changes to the default install config of Nutch that will help drive better sites to the top and spammier sites down. Secondly, I boosted the inbound anchor text config - but if anything

Re: Launch nutch from the web-application

2006-05-12 Thread Insurance Squared Inc.
We've got a php front end for version 0.71 that starts and stops crawling/indexing/fetching. One button starts the entire process - updatedb/create fetchlist/fetch/index - over and over. A second button stops the process, unless a crawl is in progress at which point it stops after the current

Re: Admin Gui beta test (was Re: ATB: Heritrix)

2006-04-28 Thread Insurance Squared Inc.
I'd prefer not to make a longterm committment, but if a 2mbit connection is good enough and this is a short term thing, I'll step up if no one else can. I could probably make a longer term committment in a few weeks. Worst case I can host it at home. glenn gekkokid wrote: what about putt

Re: redirect treatment

2006-04-15 Thread Insurance Squared Inc.
on behind the scenes in the ASP code. It knows url and content recieved. As of right now in 0.8 dev meta level redirects (meta refesh tags) don't work correctly. They did in 0.7 but I don't think that functionality has been ported. Dennis Insurance Squared Inc. wrote: How are redire

redirect treatment

2006-04-15 Thread Insurance Squared Inc.
How are redirects listed in version 0.7? If the crawler finds a link like: www.domain.com/?code.aspx&redirect=445454 and that link redirects through to www.another-domain.com, which of those two links will show up in nutch? (I'm wondering if I can use nutch to crawl sites with a lot of redire

Info on scoring/indexing and pagerank

2006-04-05 Thread Insurance Squared Inc.
Hi All, Two general questions: - I'm wondering if there are any good sources of written information on actually writing a search engine script. Things like scoring, indexing, that kind of stuff. I bought the lucene book, but that's lucene specific technical info. Looking for something at t

Re: Legal issues

2006-03-30 Thread Insurance Squared Inc.
ure using another sort of camera called robot :-) Nothing more really. If a browser maker decides to show an HTML tag lets say in 300 pixels will that be a copyright or trademark violation then? What one can do is to prevent one to be photographed or stop the robots visit one's website :-) On 3/

Re: Legal issues

2006-03-30 Thread Insurance Squared Inc.
FWIW, I believe all of what's been stated is the case - and I'd also assume that since Google/MSN/Yahoo are all doing this that it's been tested and OK. However I know many people complain about the cache. Some people see it as a copyright violation - technically correct or not, the cache doe

Removing urls from webdb

2006-03-22 Thread Insurance Squared Inc.
We've got a website that is causing our crawler to slow down (from 20mbits down to 3-5) - 400K pages that are basically not available, we're just getting 404's. I'd like to remove them from the DB to get our crawl speed back up again. Here's what our developer told me - I'm stumped, that seem

removing site from webdb

2006-03-17 Thread Insurance Squared Inc.
We've got a site that is causing our crawl to slow dramatically, from 20mbits down to about 3 or 4. The basic problem is that the site seems to consist of huge numbers of pages that aren't responding. We can remove the site from the index, but it seems like a problem to remove this site perma

Searching only a whitelist (country specific SE)

2006-03-15 Thread Insurance Squared Inc.
Hi All, We're merrily proceeding down our route of a country specific search engine, nutch seems to be working well. However we're finding some sites creeping in that aren't from our country. Specifically, we automatically allow in sites that are hosted within the country. We're finding mo

Search speed - resolution/summary

2006-03-08 Thread Insurance Squared Inc.
Seems we've found the problem that was causing our search delays. We had some indexes that were 32bytes, apparently they'd crashed somehow (not yet determined how). The existence of these segments were the source of the problem. We removed those segments and the search is running along much

Re: help with creating a directory ie front page menu of common terms

2006-03-08 Thread Insurance Squared Inc.
Just a note that while this idea is good, displaying 'recent searches' can be used by spammers. All they have to do is hammer your server with a bunch of queries to 'www.some-poker-site.com' and their website gets a link from yours. I'd be very leary of republishing any user inputs to your syst

Re: Search Speed

2006-03-08 Thread Insurance Squared Inc.
ol or script what does it says? Am 08.03.2006 um 17:38 schrieb Insurance Squared Inc.: I appreciate your patience as we try to get over our search speed issues. We're getting closer - it seems we are having huge delays when retrieving the summaries for the various search results. Bel

Search Speed

2006-03-08 Thread Insurance Squared Inc.
I appreciate your patience as we try to get over our search speed issues. We're getting closer - it seems we are having huge delays when retrieving the summaries for the various search results. Below are our logs from a search, you can see that retrieving some of the search summaries took in

Re: Link Farms

2006-03-07 Thread Insurance Squared Inc.
I don't think it is a slam dunk either, even Google doesn't do a super job of detecting these. I think a lot of it's still done manually. I think you'd have to look at detecting closed networks or mostly closed networks (since the link farm would be relatively clustered from a link perspectiv

move from nutch 0.71 to 0.8

2006-03-06 Thread Insurance Squared Inc.
I've seen it noted that a complete recrawl is necessary to migrate from 0.71 to 0.8. Is this absolutely necessary? Or could a converter be created to migrate the data? Has anyone created this? I expect at some point I'll have to move versions and something like this would be very useful. I

Re: Normal search speeds

2006-03-05 Thread Insurance Squared Inc.
something at the OS or tomcat level, or with another system process that nutch is using). Stefan Groschupf wrote: This is very slow! You can expect results in less than a second from my experience. + check memory settings of tomcat. + you do not use ndfs, right? Am 06.03.2006 um 00:23 schr

Normal search speeds

2006-03-05 Thread Insurance Squared Inc.
Asking again for the patience of the list, we're still working on speed. I guess what I need to know is if we still have a 'problem' or if the following search speeds are normal for nutch. query: 'term life insurance'; first search 25 seconds, second search 6 seconds. query: 'stratford bed and

speed concerns, calling nutch from php

2006-02-28 Thread Insurance Squared Inc.
We've built a php frontend onto nutch. We're finding that this interface is dreadfully slow and the problem is the interface between the two languages. Here's where the slow down is: $url = 'http://localhost:8080/opensearch?query=' . $query . '&start=' . $start_index .

Re: out of memory error

2006-02-22 Thread Insurance Squared Inc.
all newer tomcats require java 1.5 i do not yet use for nutch. Am 22.02.2006 um 15:59 schrieb Insurance Squared Inc.: Thanks for your help Stefan (as always). We've fixed the problem as follows: export CATALINA_OPTS="-Xms512m -Xmx2000m" into /var/jakarta-tomcat-4.1.31/bin

Re: out of memory error

2006-02-22 Thread Insurance Squared Inc.
I personal had never such problems. How many segments / indexes you have? Am 22.02.2006 um 15:21 schrieb Insurance Squared Inc.: We're getting an out of memory error when running a search using nutch 0.71 on a machine with 3 gigs of Ram. Here's the error: at org.apache.tomca

out of memory error

2006-02-22 Thread Insurance Squared Inc.
We're getting an out of memory error when running a search using nutch 0.71 on a machine with 3 gigs of Ram. Here's the error: at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPoo l.java:683) at java.lang.Thread.run(Thread.java:534) - Root Cause - ja

Excessive retries

2006-02-20 Thread Insurance Squared Inc.
Hi, We're finding that we've got one or two domains that are providing excessive retries - and that's drastically slowing our fetch process down by hours. Any general guidance on how to fix the problem? we've upped our max retries variable to 3 from 1 I believe, still getting the problem.

Deleting pages/sites from index

2006-02-15 Thread Insurance Squared Inc.
I know this has been asked a number of times, but I don't think there's been a definitive answer posted yet. Is there any way (in v0.71) to immediately remove a site (or all the pages from a site) from the index? Right now with our setup I think we have to wait the 30 days for the segment to

Off-topic:scsi vs sata/speed

2006-02-09 Thread Insurance Squared Inc.
We're finding nutch slightly slow when doing searches, I'm trying to find the least expensive route to speed things up. I swapped the server out from SCSI drives because we ran out of space. Instead I threw in a couple of Seagate 300gig SATA hard drives with software raid 0 so that I have eno

[Fwd: Re: deleting old segments]

2006-02-08 Thread Insurance Squared Inc.
based upon url of the page alone The information which you gave was also useful .But i want to do the above Rgds Prabhu On 2/8/06, *Insurance Squared Inc.* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: Hi Prabhu, Below is the script we use for deleting old

Re: deleting old segments

2006-02-08 Thread Insurance Squared Inc.
Hi Prabhu, Below is the script we use for deleting old segments. Regards, Glenn #!/bin/sh # Remove old dirs from segments dir # PERIOD is threshold for old dirs # # Created by Keren Yu Jan 31, 2006 NUTCH_DIR=/home/glenn/nutch PERIOD=30 # put dirs which are older than PERIOD into dates.tmp ls

Speeding up initial searches using cache

2006-02-06 Thread Insurance Squared Inc.
Hi, Running nutch 0.71on Mandrake linux 2006 (P4 with a 2 sata drives on raid 0, 2 gigs of ram, about 4 million pages, but expecting to hit 10+), and finding that our initial queries take up to 15-20 seconds to return results. I'd like to get that speeded up and am seeking thoughts on how to

crawl/update speed

2006-01-23 Thread Insurance Squared Inc.
Would anyone care to comment on the speed of this please? Seems awfully long to me. 20 threads, a crawl took 25 hours for about 400K URL's. It's now been updating for 20 hours and is not yet complete. System: - nutch 0.7 - P4 2.8, 1 gig of ram - No problems on the internet connection (I had

throttling bandwidth

2006-01-16 Thread Insurance Squared Inc.
My ISP called and said my nutch crawler is chewing up 20mbits on a line he's only supposed to be using 10. Is there an easy way to tinker with how much bandwidth we're using at once? I know we can change the number of open threads the crawler has, but it seems to me this won't make a huge di

large filter file, time to update db

2006-01-12 Thread Insurance Squared Inc.
Hi, I'm trying to determine if there's a better way to whitelist a large number of domains than just adding them as a regular expression in the filter. We're setting up a regional search engine and using the filter file to determine what URL's make it into the db. We've added specific domai

Nutch freezing - deflateBytes

2006-01-09 Thread Insurance Squared Inc.
Our nutch installation (version .7, running on Mandrake linux) continues to freeze sporadically during fetching. Our developer has it pinned down to the deflateBytes library. "it looped in the native method called deflateBytes for very long time. Some times, it took several hours." That's a

upgrade to version 0.8

2006-01-04 Thread Insurance Squared Inc.
We're just wiping down a server to install nutch in more of a production environment. Does anyone have any thoughts on whether we should upgrade to version 0.8, from version .7? Our developer suggested that the only reason we would need to do that would be if we needed distributed computing;

Re: Nutch freezing on fetch

2005-12-30 Thread Insurance Squared Inc.
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:351) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488) "VM Thread" prio=1 tid=0x08090638 nid=0x1442 runnable "VM Periodic Task Thread" prio=1 tid=0x08099cd8 nid=0x1442 waiting on condition "Suspend C

Re: Nutch freezing on fetch

2005-12-30 Thread Insurance Squared Inc.
Andrzej Bialecki wrote: Insurance Squared Inc. wrote: Hi All, We're experiencing problems with nutch freezing sporadically when fetching. Not really to sure where to even start investigating. Some digging into the archives suggested memory issues, so we did the foll

Nutch freezing on fetch

2005-12-30 Thread Insurance Squared Inc.
Hi All, We're experiencing problems with nutch freezing sporadically when fetching. Not really to sure where to even start investigating. Some digging into the archives suggested memory issues, so we did the following: TOMCAT_OPTS=" -Xmx1024M" to increase Tomcat memory and NUTCH_HEAPSIZE=1

Re: Crawling search engines and cgi scripts

2005-12-16 Thread Insurance Squared Inc.
lter.Any general thoughts on how we might start to tackle this? Thanks. Insurance Squared Inc. wrote: We're running a crawl using nutch and the last crawl seemed to be taking a long time. Looking at the output, it seems it's gone into AOL's search and is actually crawling

Crawling search engines and cgi scripts

2005-12-16 Thread Insurance Squared Inc.
We're running a crawl using nutch and the last crawl seemed to be taking a long time. Looking at the output, it seems it's gone into AOL's search and is actually crawling search results (it's also crawling some cgi-bin search results page on another site). This sure seems like it could go on

User Agent

2005-12-09 Thread Insurance Squared Inc.
What should I be using for a user agent in the crawler? We just tried crawling a government site and if we leave the user agent set to nutch, we get the crawl. When I change it, I'm getting blocked with an error about the user agent not being supported. It seems that I should be changing the

Cache page, modifying the output

2005-12-09 Thread Insurance Squared Inc.
We're putting a php front end on Nuche, have it working for the search results page by grabbing an xml feed from nuche. However the cache link is still calling nuche directly, and my developer indicated that the cache page doesn't seem to be available via xml. What's the best way for us to ta

Re: Setting up a crawler for a country.

2005-12-07 Thread Insurance Squared Inc.
ow that verisign makes this available for .com and .net as "TLD zone files". for ccTLDs like .us and .uk, you'll have to see if the TLD registrar provides the same. the following page has some useful links to these folks: http://www.dnsstuff.com/info/dnslinks.htm --matt O

Re: ad feed for nutch

2005-12-07 Thread Insurance Squared Inc.
t that will also require ad serving, but want it to be open source and give greater transparency to the advertisers than they get today with google and overture. If you start developing one, were you thinking of making this open source project? Thanks.

ad feed for nutch

2005-12-06 Thread Insurance Squared Inc.
Has anyone had any luck with advertising/ad management systems being integrated into nutch? Not just something for the owner to admin ads, but to allow external advertisers to manage their accounts/bids, that kind of thing. I'm drawing up plans for one if none are available, but clearly somet

Crawling TLD's + injected sites.

2005-12-06 Thread Insurance Squared Inc.
We're trying to index based on a country. What I'm trying to accomplish is: - auto crawl sites with the correct TLD - auto crawl manually injected sites. from this, I then only want to further follow sites that match the TLD. This means that sites with the correct TLD extension, if found anyw

Configuration Docs

2005-11-30 Thread Insurance Squared Inc.
Where's the best source for documentation on the configuration of nutch? Not finding much/anything, my developer is wasting his days pouring over the code :). Thanks.

Re: Setting up a crawler for a country.

2005-11-29 Thread Insurance Squared Inc.
Along these same lines (as I'm interested in a similiar country-specific project), is there any place to get a list of all the domains for a specific TLD to use to seed nutch? i.e. if I wanted to get a list of all currently registered .it, .de, or .ca's? I've looked without success. I'm thin

Looking for nutch consultant

2005-11-28 Thread Insurance Squared Inc.
I'm setting up a nutch-based SE in a small niche and need some assistance setting up the program. I need some help, for a quick walk through on setting up the system. Specifically I'd like a walk through on setup, configuration, how to set up a polite crawler, how often and deep to crawl, run

Using nutch for niche/country specific TLD

2005-11-07 Thread Insurance Squared Inc.
Hi, We're barely past the install stages with nutch, I'd like to ask the more experienced a few general questions before I jump in with both feet. I'm thinking about creating a country specific (by TLD) search engine. - Can nutch only crawl specific TLD's? (i.e. like .it, or .uk.com). My