Re: reoot site query results

2004-12-06 Thread Doug Cutting
In web search, link information helps greatly. (This was Google's big discovery.) There are lots more links that point to http://www.slashdot.org/ than to http://www.slashdot.org/xxx/yyy, and many (if not most) of these links have the term "slashdot", while links to http://www.slashdot.org/xx

Re: reoot site query results

2004-12-06 Thread Erik Hatcher
Perhaps look at Nutch to see whether (and if so, how) it deals with this situation. Determining the root seems to be a pretty tricky endeavor. Each of these could be a root: http://www.ehatchersolutions.com/JavaDevWithAnt http://www.example.com/~username And certainly lots of o

Re: reoot site query results

2004-12-06 Thread Chris Fraschetti
I do this to some extent... currently I apply a boost if its as best i can tell a root page. But I am more asking how to determine root pages... content obviously isn't easy to use ... the url is the main key... but that can be tricky as well... Basically the pages are from a crawl.. so their urls

Re: reoot site query results

2004-12-06 Thread Erik Hatcher
On Dec 6, 2004, at 4:53 AM, Chris Fraschetti wrote: My lucene implementation works great, its basically an index of many web crawls. The main thing my users complain about is say a search for "slashdot" will return the http://www.slashdot.org/soem_dir/somepage.asp as the top result because the fact

reoot site query results

2004-12-06 Thread Chris Fraschetti
My lucene implementation works great, its basically an index of many web crawls. The main thing my users complain about is say a search for "slashdot" will return the http://www.slashdot.org/soem_dir/somepage.asp as the top result because the factors i have scoring it determine it as so... but obvi