Hi Mike, Nutch general folks >An obvious and easy >step is for a cluster system to maintain its own local Nutch >index, and then postprocess them.
Yes, that's what I want to do with Searchtuna, where an appropriately focused index has been created to answer typed queries (e.g. science, popular culture or medical). > However, Nutch has been designed to generate a >traditional-style hitlist, not clusters. So I don't know >whether it's possible to use our index for really deep >integration. Maybe it could, that's a design issue. I'm not sure Nutch will ever achieve "lift off" if it's just like Google and requires the same resources and money to exist. >If you'd like to take a look at some Nutch-based >clustering work, possibly in conjunction with someone else from >the nutch-dev list, of course we'd welcome it. I would, how do I get that process started? > BTW, my TunaSearch for "Red Sox" just finished. I can't > believe it found "Game 6" as a related concept! Sigh. I you liked that, you'll love the query I ran for "Boston Red Sox curse" the very day that Grady left Pedro in the game too long. http://www.searchtuna.com/cgi-bin/st.cgi?action=main:getSearchPage&searchId=680 Cheers, Dave Cohen -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Michael Cafarella Sent: Sunday, February 01, 2004 3:49 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [Nutch-general] Re: Assumptions About Internet Search Hi Dave, Sorry for the long delay in mailing. Clustering is definitely something a lot of people have expressed interest in. You might want to check out the "nutch-dev" mailing list, where there's been a lot of discussion of the Carrot^2 project. I don't know an enormous amount about it myself. As far as I know, most clustering systems scrape remote search engines for results, or do some runtime crawling like your SearchTuna seems to. An obvious and easy step is for a cluster system to maintain its own local Nutch index, and then postprocess them. However, Nutch has been designed to generate a traditional-style hitlist, not clusters. So I don't know whether it's possible to use our index for really deep integration. The main priority for Nutch is generating a great hit-list, but we also want it to be the best platform for experimentation with search. If you'd like to take a look at some Nutch-based clustering work, possibly in conjunction with someone else from the nutch-dev list, of course we'd welcome it. BTW, my TunaSearch for "Red Sox" just finished. I can't believe it found "Game 6" as a related concept! Sigh. --Mike On Sat, 2003-12-20 at 13:53, Dave Cohen wrote: > Hello all, > > First off I think Nutch is a much-needed project because it > will deliver unbiased results or make the bias public - that is > the context of my remarks below. > > Looking through your FAQ, I read the following: > > >We don't think it is presently possible to build a peer-to-peer > >search engine that is competitive with existing search engines. > >It would just be too slow. Returning results in less than a second > >is important: it lets people rapidly reformulate their queries so > >that they can more often find what they're looking for. In short, > >a fast search engine is a better search engine. I don't think many > >people would want to use a search engine that takes ten or more > >seconds to return results. > > As far as I know, these remarks contain many untested assumptions. > Generally speaking, I don't think people know how to formulate queries. > Getting fast results for poor non-context-sensitive queries over and > over again is not very helpful to the user. Some search engines and > metacrawlers attempt to remedy this with various techniques (e.g. > related search clustering at Vivisimo or hub page results at Teoma). > Google's results are not very helpful to the user at all. They have > no incentive to make them so and also of course they are focused > entirely on text-based ads. > > In addition, there is mention of "10 to 20" search tuning parameters > given the Nutch index. I believe that in 5 years or 10 years, without > some paradigm shift in search technology, there will still be these > same 10 or 20 parameters and an index with rapid query response. There > has been little innovation in search technology since Google introduced > PageRank link analysis about 5 years ago. And there will be little in > the future unless some new direction is taken. > > I wrote a demo site called SearchTuna (so hard to find names these days) > at http://www.searchtuna.com/. Working without benefit of an index, I > search the web for the user and e-mail them the results. It's just one > Linux machine connected to a shared T1, so results take awhile. But, a lot > of useful, unbiased information is presented back to the user, including > a definition, image, keywords for query refinement, so-called content > pages (authorities, hopefully) and hub pages. Importantly, with an > appropriately constructed index and some big processing power, I believe > all this information could be delivered to the user in well under a minute. > A lot of useful information can be created, given a short, coherent query. > > RE: indexing the web, I believe a lot of semantically useful metainfo can > be created and stored as the indexing goes on. Query results can then consult > this "metaindex". Also, if you have a specific type of (context for a) query > (e.g. medical), then a good search engine should look at the appropriate > parts of the index for a query of that type. I know Google is interested > in this kind of thing, given their recent technology acquisitions. Finally, > I am working on a query-reformulator (Noun phrase parsing/analysis) that I > believe can be helpful to ordinary searchers who don't know how to find what > they want. > > I'd like to be involved somehow in Nutch but don't know if my work is > of any interest to you. Please look over some SearchTuna results if you > would - recent search examples are available on the homepage. Any comments > about all of the above are certainly welcome. I know I'm swimming against > the current here. > > Sincerely, > > Dave Cohen > [EMAIL PROTECTED] (or [EMAIL PROTECTED]) > Boulder, Colorado > > > > > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IBM Linux Tutorials. > Become an expert in LINUX or just sharpen your skills. Sign up for IBM's > Free Linux Tutorials. Learn everything from the bash shell to sys admin. > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click > _______________________________________________ > Nutch-general mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/nutch-general ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
