Re: [htdig-dev] Incremenal Index Efficiency, Unicode, & 2GIG limit

Gilles Detillieux Wed, 09 Jan 2002 13:16:21 -0800

According to Neal Richter:
>       Is the thinking that these changes (queriable index fields) would
> take place in the mifluz indexing code?  Is mifluz the core index code in
> 3.2?


Yes and no.  mifluz is the core WORD indexing code in 3.2, but the
additional fields would have to go into db.docdb, which is separate.
The words from these fields would also go into the mifluz word database,
but I don't think that would require any changes at all to mifluz,
as long as it's open-ended as far as the flag values that are tied to
words in the index.  The changes would be more at the Retriever and
document parser level, as well as the query parser, but not in the guts
of the word database code.  The extra fields in db.docdb would only be
needed if you need the ability to extract those fields in search results,
the way you can now with meta descriptions.  That wouldn't strictly be
needed if all you want to do is limit searches to specific fields.

> > The release will happen when it's ready. The two main "chunks" that
> > absolutely, positively need to be finished are a sync with the current
> > mifluz code and the new htsearch framework. Beyond that, many things are
> > up in the air--if people come along to do things like Unicode and they can
> > be tested thoroughly, great. If not, it moves back.
> 
>       Just to be clear.. do you mean the release is moved back or that
> the feature is postponed until the next release?  I'm guessing the latter.

Yes, I believe Geoff meant the latter, although we'd certainly consider
holding off a release if some much-requested change is right on the way
(this is usually done by a developers' vote during a pre-release code
freeze, which we're not at yet in 3.2 development).

>       I'm trying to get an good idea of what feature goals the htdig
> developers have 3.2 [other than what was just given].
> 
>       Not to be pushy, but how complete does everyone consider 3.2 to be
> currently?

Well, how complete it is depends on what features we want in 3.2, as
well as how solid it is.  3.2.0b4 is getting to be reasonably solid,
but there are some major changes coming in before it's released, so
I'd expect more debugging time still before it's ready even for beta
release.  Having said that, there are sites using 3.2 in a production
environment already.

>       My main immediate task is to choose to use either 3.1.5 or
> 3.2Beta as a starting point.  I would obviously prefer to use 3.2Beta and
> submit bug-fixes, feature improvements, & memory leak/corruption fixes
> than do that effort on a previous version.. as well as test 3.2Beta on
> very large data-sets..

For what you're proposing, 3.2 is the only viable option.  3.1.x is
strictly in maintenance mode, so we're not making sweeping new changes
to it.  Anything to do with different search fields or word categories
wouldn't fit into the word database scheme used in 3.1.x.  I'd recommend
you use the latest 3.2.0b4 snapshot as your starting point, or better
yet the htdig-3-2-x branch of the htdig CVS tree on SourceForge.

>       A ball-park ETA is fine..  I'm looking at an early spring
> release for the next version of the software that will use this as an
> archiving tool.  [See www.rightnow.com for more info on our software]
> Note that this (our software) release does not require the archive to be
> UTF-8 ready... just moving in that direction.
> 
>       With QA/Testing taking a big chunk of that time, I want to be
> confident that the state of the 3.2Beta code as of mid-february is good
> enough to send to QA... and of course get all the fixes from our QA back
> to htdig ASAP ;-) !
> 
>       I'm gathering implicitly that the changes envisioned before the 
> 3.2 release are improvements.. ie the Beta code is fully functional now.  
> Correct?

Parts of it are fully functional now, but some parts are still in early
development (e.g. Cookie support).  Whether it's solid for you depends
on what parts you use and what parts you don't.

>       I'd be happy to assist in syncing with mifluz.

Great, that's probably the most urgent change to 3.2 that's needed to
improve its reliability.  I wouldn't say that it's impossible to get
3.2 reasonably solid by mid-February, but it is an awfully tight
deadline!

>       The word-breaking problem for Asian languages like Japanese is a
> hard one.  There are no 'spaces' in most Japanese text, so a language
> specific word-breaking algorithm is needed before any soft of searchable
> index can be built.

Yes, both this problem and the 8-bit limitation must be dealt with
before htdig can deal with most Asian languages.

>       We've contracted with a large internationalization-software 
> company to assist us in our conversion to be UTF-8.  We accomplish word
> breaking for Japanese documents with software that they supply under
> contract via NDA.  I'll ask our in-house Japanese developer about free
> tools for word breaking.

I'd caution that any htdig development should be done in "clean-room"
conditions, i.e. not by anyone who's signed the NDA and seen
the proprietary code, to avoid any licensing pitfalls.  (See
http://www.linuxjournal.com/article.php?sid=5496 for a cautionary tale.)

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Incremenal Index Efficiency, Unicode, & 2GIG limit

Reply via email to