In web search, link information helps greatly. (This was Google's big
discovery.) There are lots more links that point to
http://www.slashdot.org/ than to http://www.slashdot.org/xxx/yyy, and
many (if not most) of these links have the term "slashdot", while links
to http://www.slashdot.org/xx
Perhaps look at Nutch to see whether (and if so, how) it deals with
this situation.
Determining the root seems to be a pretty tricky endeavor. Each of
these could be a root:
http://www.ehatchersolutions.com/JavaDevWithAnt
http://www.example.com/~username
And certainly lots of o
I do this to some extent... currently I apply a boost if its as best i
can tell a root page. But I am more asking how to determine root
pages... content obviously isn't easy to use ... the url is the main
key... but that can be tricky as well... Basically the pages are from
a crawl.. so their urls
On Dec 6, 2004, at 4:53 AM, Chris Fraschetti wrote:
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search for
"slashdot" will return the
http://www.slashdot.org/soem_dir/somepage.asp as the top result
because the fact
My lucene implementation works great, its basically an index of many
web crawls. The main thing my users complain about is say a search for
"slashdot" will return the
http://www.slashdot.org/soem_dir/somepage.asp as the top result
because the factors i have scoring it determine it as so... but
obvi