Re: reoot site query results

2004-12-06 Thread Doug Cutting
In web search, link information helps greatly.  (This was Google's big 
discovery.)  There are lots more links that point to 
http://www.slashdot.org/ than to http://www.slashdot.org/xxx/yyy, and 
many (if not most) of these links have the term "slashdot", while links 
to http://www.slashdot.org/xxx/yyy are somewhat less likely to contain 
the term "slashdot".

As Erik hinted, Nutch uses this information.  It keeps has a database of 
links that point to each page, indexes their anchor text along with the 
page, and boosts highly linked pages more than lesser linked pages.

Doug
Chris Fraschetti wrote:
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search for
"slashdot" will return the
http://www.slashdot.org/soem_dir/somepage.asp as the top result
because the factors i have scoring it determine it as so... but
obviously in true search engine fashion.. i would like
http://www.slashdot.org/ to be the very top result... i've added a
boost to queries that match the hostname field, which helped a little,
but obviously not a proper solution. Does anyone out there in the
search engine world have a good schema for determining root websites
and applying a huge boost to them in one fashion or another? mainly so
it appears before any sub pages? (assuming the query is in reference
to that site) ...
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: reoot site query results

2004-12-06 Thread Erik Hatcher
Perhaps look at Nutch to see whether (and if so, how) it deals with 
this situation.

Determining the root seems to be a pretty tricky endeavor.  Each of 
these could be a root:

http://www.ehatchersolutions.com/JavaDevWithAnt
http://www.example.com/~username
And certainly lots of other combinations like you've described.
Erik
On Dec 6, 2004, at 5:54 AM, Chris Fraschetti wrote:
I do this to some extent... currently I apply a boost if its as best i
can tell a root page. But I am more asking how to determine root
pages... content obviously isn't easy to use ... the url is the main
key... but that can be tricky as well...  Basically the pages are from
a crawl.. so their urls so how they were originally linked to.. i.e.
http://www.microsoft.com may have been visited via an outgoing link of
another page as http://www.microsoft.com/index.asp?title=true  or some
variant like that. the page is still the root, but now contains a
page. Further into that I can simple check the hostname of the url
using java's URL class, as well as the path that the URL class gives
me... but how much of a boost would be appropriate. Too must of a
boost might make it return higher than perhaps a non-root page which
is more relevant.
On Mon, 6 Dec 2004 05:12:27 -0500, Erik Hatcher
<[EMAIL PROTECTED]> wrote:

On Dec 6, 2004, at 4:53 AM, Chris Fraschetti wrote:
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search 
for
"slashdot" will return the
http://www.slashdot.org/soem_dir/somepage.asp as the top result
because the factors i have scoring it determine it as so... but
obviously in true search engine fashion.. i would like
http://www.slashdot.org/ to be the very top result... i've added a
boost to queries that match the hostname field, which helped a 
little,
but obviously not a proper solution. Does anyone out there in the
search engine world have a good schema for determining root websites
and applying a huge boost to them in one fashion or another? mainly 
so
it appears before any sub pages? (assuming the query is in reference
to that site) ...
Consider applying the boost to the Document, rather than the field, at
index time.  I assume each document in your index represents one page.
At indexing time you know whether it is a root page or not, right?
   Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
___
Chris Fraschetti, Student CompSci System Admin
University of San Francisco
e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: reoot site query results

2004-12-06 Thread Chris Fraschetti
I do this to some extent... currently I apply a boost if its as best i
can tell a root page. But I am more asking how to determine root
pages... content obviously isn't easy to use ... the url is the main
key... but that can be tricky as well...  Basically the pages are from
a crawl.. so their urls so how they were originally linked to.. i.e.
http://www.microsoft.com may have been visited via an outgoing link of
another page as http://www.microsoft.com/index.asp?title=true  or some
variant like that. the page is still the root, but now contains a
page. Further into that I can simple check the hostname of the url
using java's URL class, as well as the path that the URL class gives
me... but how much of a boost would be appropriate. Too must of a
boost might make it return higher than perhaps a non-root page which
is more relevant.


On Mon, 6 Dec 2004 05:12:27 -0500, Erik Hatcher
<[EMAIL PROTECTED]> wrote:
> 
> 
> 
> On Dec 6, 2004, at 4:53 AM, Chris Fraschetti wrote:
> > My lucene implementation works great, its basically an index of many
> > web crawls. The main thing my users complain about is say a search for
> > "slashdot" will return the
> > http://www.slashdot.org/soem_dir/somepage.asp as the top result
> > because the factors i have scoring it determine it as so... but
> > obviously in true search engine fashion.. i would like
> > http://www.slashdot.org/ to be the very top result... i've added a
> > boost to queries that match the hostname field, which helped a little,
> > but obviously not a proper solution. Does anyone out there in the
> > search engine world have a good schema for determining root websites
> > and applying a huge boost to them in one fashion or another? mainly so
> > it appears before any sub pages? (assuming the query is in reference
> > to that site) ...
> 
> Consider applying the boost to the Document, rather than the field, at
> index time.  I assume each document in your index represents one page.
> At indexing time you know whether it is a root page or not, right?
> 
>Erik
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
___
Chris Fraschetti, Student CompSci System Admin
University of San Francisco
e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: reoot site query results

2004-12-06 Thread Erik Hatcher
On Dec 6, 2004, at 4:53 AM, Chris Fraschetti wrote:
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search for
"slashdot" will return the
http://www.slashdot.org/soem_dir/somepage.asp as the top result
because the factors i have scoring it determine it as so... but
obviously in true search engine fashion.. i would like
http://www.slashdot.org/ to be the very top result... i've added a
boost to queries that match the hostname field, which helped a little,
but obviously not a proper solution. Does anyone out there in the
search engine world have a good schema for determining root websites
and applying a huge boost to them in one fashion or another? mainly so
it appears before any sub pages? (assuming the query is in reference
to that site) ...
Consider applying the boost to the Document, rather than the field, at 
index time.  I assume each document in your index represents one page.  
At indexing time you know whether it is a root page or not, right?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


reoot site query results

2004-12-06 Thread Chris Fraschetti
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search for
"slashdot" will return the
http://www.slashdot.org/soem_dir/somepage.asp as the top result
because the factors i have scoring it determine it as so... but
obviously in true search engine fashion.. i would like
http://www.slashdot.org/ to be the very top result... i've added a
boost to queries that match the hostname field, which helped a little,
but obviously not a proper solution. Does anyone out there in the
search engine world have a good schema for determining root websites
and applying a huge boost to them in one fashion or another? mainly so
it appears before any sub pages? (assuming the query is in reference
to that site) ...

-- 
___
Chris Fraschetti
e [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]