> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Andrew Moise > Sent: Thursday, May 13, 2004 12:46 PM > To: [EMAIL PROTECTED] > Cc: Jim; [EMAIL PROTECTED] > Subject: Re: [htdig-dev] 3.2.0 - is it worth it?? > > > On Thu, 2004-05-13 at 08:58, Lachlan Andrew wrote: > > My impression is that the ht://Dig project is basically > dead :( The > > existing code is of course still functional (and thanks Jim and > > Gilles for all the support you give to the users!), but I > don't think > > there is enough enthusiasm to either release a new version, either > > 3.2 or 3.3. If I get enthusiastic in the next couple of weeks, I > > might still try to put 3.2.0b6 together, but that is about > as far as > > it will go... > > As a user, I'm very sorry to hear this -- I just deployed > 3.2.0b5 on a > site I administrate, and I've been very pleased with it. I've been > waiting until the 3.2.0 release cycle was over to start trying to > contribute some of my tweaks (I also need to talk to my employer about > the legalities first), but I guess if the project is > stagnating I should > speak up (there's also the possibility that the imminent > death of htdig > just makes this extra silly, of course... *shrug*). In any case: > So htdig does a bad job when multiple documents match a search in > similar ways; this shows up particularly when your search > query matches > part of the header or footer of a section of your site, or when your > search results include threads from a mailing list archive (in which > case messages within a thread often show up consecutively in the > results, which adds a lot of noise). I wrote some code (shoehorned in > as a ScoreMatch, more for easy control by the 'sort' > parameter than for > any logical reason) which sorts the results once, then > reduces the score > of any match which is similar to matches that are higher in the list, > then resorts the results; thus the high-ranked results that > are returned > tend to be more unique than otherwise. This is marginally > helpful with > the header/footer problem (though the excerpts are still usually > identical in that case), and very helpful with the mailing-list-thread > problem. AFAICT it doesn't do too much harm to the results in > the normal > case. > We also found it beneficial to tweak results' scores by > matching their > URLs against a handmade list of URL pieces and score-hacking factors > (mailing list archives are mediocre, IRC archives are usually > unhelpful, > a particular section of documentation is generally very useful) -- I > know this is gross, but it did wonders for the effectiveness of our > search results, and a coworker of mine convinced me that it's not > totally against nature -- humans really do have special knowledge of > which sections of a site are generally "good," and with an > hour or so of > tweaking we got things in a state where close results from a "bad" > section are presented above loose results from a "good" section when > appropriate (more or less). > It seems to me that it would be useful to generalize these little > hacks into a search parameter listing which hacks should be > applied; for > example, to select the two score hacks described in the above > paragraphs > you could specify 'result_hacks=unique,urlmatch' in the > search query or > htdig.conf. htdig already has a couple of result hacks that could fit > into this scheme (backlink_factor and date_factor), and I can think of > one more at least that I'd like to add in my copious free time. It > certainly would seem right to me to be able (a) to add stuff like the > above tweaks to the codebase without forcing everyone to care > about it, > and (b) to test, tweak, and reorder the scoring hacks from a query > parameter while trying to get things configured to work well. > As I said, I've got (wrongly-integrated) code for the two tweaks I > mentioned, which I can try to get into a presentable state, > and I might > be able to find time semi-soon to do the work for the general > result_hacks parameter, if there are people that think either of those > would be worthwhile. Are there such people? >
I added a hack to 3.1.6 to allow humans to decide what URL should get top billing for particular terms (http://search.aarp.org/cgi-bin/htsearch?config=htdig_www_aarp_org&restr ict=&exclude=research.aarp.org&words=ageline). I've been eagerly awaiting a 3.2 release to try to shoe horn that into the new database structure. And our users have been clamoring for phrase searches. I would love to see 3.2 released but I understand if time, work and school pressures make that impossible. -David ------------------------------------------------------- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get the fastest 6.0/768 connection for only $19.95/mo for the first 3 months! http://ads.osdn.com/?ad_id%62&alloc_ida84&op=click _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev
