RE: [Nutch-general] Re: Assumptions About Internet Search

Dave Cohen Tue, 03 Feb 2004 13:20:41 -0800

Hi Mike, Nutch general folks

>An obvious and easy 
>step is for a cluster system to maintain its own local Nutch 
>index, and then postprocess them.


Yes, that's what I want to do with Searchtuna, where an 
appropriately focused index has been created to answer 
typed queries (e.g. science, popular culture or medical).

>    However, Nutch has been designed to generate a 
>traditional-style hitlist, not clusters.  So I don't know 
>whether it's possible to use our index for really deep
>integration.  

Maybe it could, that's a design issue. I'm not sure Nutch will ever
achieve "lift off" if it's just like Google and requires the same
resources and money to exist.

>If you'd like to take a look at some Nutch-based 
>clustering work, possibly in conjunction with someone else from 
>the nutch-dev list, of course we'd welcome it.  

I would, how do I get that process started?

>  BTW, my TunaSearch for "Red Sox" just finished.  I can't
> believe it found "Game 6" as a related concept!  Sigh.

I you liked that, you'll love the query I ran for "Boston Red Sox curse"
the very day that Grady left Pedro in the game too long.
http://www.searchtuna.com/cgi-bin/st.cgi?action=main:getSearchPage&searchId=680

Cheers,

Dave Cohen

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Michael
Cafarella
Sent: Sunday, February 01, 2004 3:49 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] Re: Assumptions About Internet Search



  Hi Dave,

  Sorry for the long delay in mailing.

  Clustering is definitely something a lot of people have
expressed interest in.  You might want to check out the
"nutch-dev" mailing list, where there's been a lot of
discussion of the Carrot^2 project.  I don't know an
enormous amount about it myself.

    As far as I know, most clustering systems scrape 
remote search engines for results, or do some runtime
crawling like your SearchTuna seems to.  An obvious and easy 
step is for a cluster system to maintain its own local Nutch 
index, and then postprocess them.

    However, Nutch has been designed to generate a 
traditional-style hitlist, not clusters.  So I don't know 
whether it's possible to use our index for really deep
integration.  

  The main priority for Nutch is generating a great hit-list, 
but we also want it to be the best platform for experimentation
with search.  If you'd like to take a look at some Nutch-based 
clustering work, possibly in conjunction with someone else from 
the nutch-dev list, of course we'd welcome it.  

  BTW, my TunaSearch for "Red Sox" just finished.  I can't
believe it found "Game 6" as a related concept!  Sigh.

  --Mike



On Sat, 2003-12-20 at 13:53, Dave Cohen wrote:
> Hello all,
> 
> First off I think Nutch is a much-needed project because it
> will deliver unbiased results or make the bias public - that is
> the context of my remarks below.
> 
> Looking through your FAQ, I read the following:
> 
> >We don't think it is presently possible to build a peer-to-peer 
> >search engine that is competitive with existing search engines. 
> >It would just be too slow. Returning results in less than a second 
> >is important: it lets people rapidly reformulate their queries so 
> >that they can more often find what they're looking for. In short, 
> >a fast search engine is a better search engine. I don't think many 
> >people would want to use a search engine that takes ten or more 
> >seconds to return results.
> 
> As far as I know, these remarks contain many untested assumptions.
> Generally speaking, I don't think people know how to formulate queries.
> Getting fast results for poor non-context-sensitive queries over and 
> over again is not very helpful to the user. Some search engines and 
> metacrawlers attempt to remedy this with various techniques (e.g. 
> related search clustering at Vivisimo or hub page results at Teoma). 
> Google's results are not very helpful to the user at all. They have 
> no incentive to make them so and also of course they are focused 
> entirely on text-based ads.
> 
> In addition, there is mention of "10 to 20" search tuning parameters
> given the Nutch index. I believe that in 5 years or 10 years, without
> some paradigm shift in search technology, there will still be these
> same 10 or 20 parameters and an index with rapid query response. There 
> has been little innovation in search technology since Google introduced 
> PageRank link analysis about 5 years ago. And there will be little in 
> the future unless some new direction is taken.
> 
> I wrote a demo site called SearchTuna (so hard to find names these days)
> at http://www.searchtuna.com/. Working without benefit of an index, I
> search the web for the user and e-mail them the results. It's just one
> Linux machine connected to a shared T1, so results take awhile. But, a lot
> of useful, unbiased information is presented back to the user, including
> a definition, image, keywords for query refinement, so-called content
> pages (authorities, hopefully) and hub pages. Importantly, with an
> appropriately constructed index and some big processing power, I believe 
> all this information could be delivered to the user in well under a minute.
> A lot of useful information can be created, given a short, coherent query.
> 
> RE: indexing the web, I believe a lot of semantically useful metainfo can
> be created and stored as the indexing goes on. Query results can then consult
> this "metaindex". Also, if you have a specific type of (context for a) query
> (e.g. medical), then a good search engine should look at the appropriate 
> parts of the index for a query of that type. I know Google is interested
> in this kind of thing, given their recent technology acquisitions. Finally, 
> I am working on a query-reformulator (Noun phrase parsing/analysis) that I 
> believe can be helpful to ordinary searchers who don't know how to find what 
> they want.
> 
> I'd like to be involved somehow in Nutch but don't know if my work is
> of any interest to you. Please look over some SearchTuna results if you
> would - recent search examples are available on the homepage. Any comments
> about all of the above are certainly welcome. I know I'm swimming against
> the current here.
> 
> Sincerely,
> 
> Dave Cohen
> [EMAIL PROTECTED] (or [EMAIL PROTECTED])
> Boulder, Colorado
> 
> 
> 
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: IBM Linux Tutorials.
> Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
> Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
> Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
> _______________________________________________
> Nutch-general mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-general




-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general


-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

RE: [Nutch-general] Re: Assumptions About Internet Search

Reply via email to