Re: [htdig] Knowledge re-use suggestion

Kev Shepherd Wed, 29 Jan 2003 19:01:26 -0800

>>Looking through the logs, there seemed to be valuable information buried away.
>
>Yes ... it's our 'mine'. :-)


Well put.


>I had thought this a little bit different, but unfortunately (again!) I had no time 
>to make those ideas real. I was thinking about considering effective found URLs those 
>ones which have a following click distant more than, let's say, 10 seconds in the 
>same session (that's where the browsing session comes for!). Of course there must be 
>some euristic assumptions to be taken for the last entry.
>
>By doing this you could get rid of some of the false results you were talking about.

This is a little more ambitious than what I had in mind.  Initially I think we could 
make assumptions about "noise" ... click and see responses.  With a time-based decay 
function on the weighting, they would settle out, and true hits would emerge out of 
the noise.  In our organisation there are definitely "hot topics" that come and go.

If people are just following the sequence of links in the results list, chances are 
that the pages they found were quite relevant to their search pattern anyway, and 
belonged near the top of the list.


>Using usage information internally is what ht://Dig lacks, as we already use both 
>content and structure someway (through the 'backlink_factor' attribute for instance). 
>IMHO this could be very hard to implement, at least without re-thinking the actual 
>design of the whole system.

I don't know the internal workings of htdig but there are already weighting factors 
for description, backlinks, etc.  I realise that these are global weightings.  I 
suspect I am talking about a "recency" factor, but the problem is that it needs to be 
associated with each URL, and this might require continuously updating the document 
database (or an auxiliary database).  On the other hand I would be happy to just 
re-merge every night (to incorporate the followed-links log).


>I guess people like Geoff and Neal could be more precise, but an internal  usage 
>module should be something more than a urls/frequency archive, as long as, for 
>instance, not all the clicked URLs could be what the user effectively wanted; the 
>learning mechanism should be designed from scratch for ht://Dig and consistency 
>between different crawls and incremental updating should be preserved. Also, I am 
>afraid that such a module would affect also the actual design.

I think the decay function would effectively filter out irrelevant URLs.


>However, this is something I'd really love to see implemented in ht://Dig but, not in 
>this phase, as I think we have to be realist: our next aim is to release the first 
>3.2 branch's stable release. That must be our first step.

OK.  I don't want to start a large re-engineering project, but perhaps an incremental 
approach might let us play with some of these ideas.


Regards, Kev.



---
Kevin Shepherd, 
a/Regional Computing Manager
Bureau of Meteorology, Hobart
Ph 03 6221 2103



-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] Knowledge re-use suggestion

Reply via email to