Geoff Hutchison wrote:
> 
> >Wonderful IDEA Geoff. I think it's a good solution. Using bit masks we
> >could improve our disk requirements and indexing and searching could not
> >get too much lower ... But, if I am not wrong, by this way, the user could
> >set other 8 fields. For me, the only problem is: are they enough?
> 
> That's my question. That's why I said either 16 bits or 32 bits. I would
> assume that 24 flags should be enough? I was putting this up for comment,
> not writing it in stone. *I* couldn't come up with 8 more tags, but that's
> me.

I can...  :-)
Since most of the documents that are going to be index are HTML, might as well
use some of the information we can gleam from the structure tags.  the title
and keywords are pretty obvious, but why stop there?  there are table
headings, the different header levels, blockquotes, definition lists,
(un)ordered lists, code, hyperlinked text, etc.

In addition to that, there is the potential of digging deeper into msword
docs, so you'd need to include common word processing styles as well.

Now, where the structure search will be most useful will be in documents that
actually have a well defined structure *OR* documents that can be interpreted
as having structure.  What I mean by that is the type of things portals
generally use to find useful information on other web pages.  (DailyUpdate
comes to mind...)

This implies (this is going *way* off topic here...) that internally you need
a generalized word source interface.  Each document type *and* structure can
have its own word-source which will convert whatever the document is into a
stream of words with the right flags set.  All of this would be really easy to
do if you had:  a)  threads   b)  dynamic code loading, possibly with website
specific/defined word-source code that gets downloaded on demand.  This is why
I like Java for a search engine...  It can do all of that and still be
portable, secure, and fast :-)

> >And what do you mean by word position, exactly? Is it the position in chars
> >terms, HTML tags excluded?
> 
> I meant word order, tags excluded. So in this e-mail, "Wonderful" would
> have position 1 and "IDEA" would have position 2.

Hmmm...  This made me think of something I'd never thought about before... 
One of things that seems to be hard to deal with is defining exactly what a
word is.   Everyone is most likely very much aware of my first attempt at
this: valid_punctuation.  :-)  Well, I think a much better method would be to
add multiple word permutations to the database.  For example something like
"D'Amore" (last name of one of my coworkers) could be entered into the
database as "d'amore" and "amore".  The problem there is the word location. 
My first thought was to give them both the same location number, but they
really aren't the same word, so a phrase search (which would presumably need
to be done on the *exact* words, not permutations) could possibly give
incorrect results.  Maybe a better example would be something like
"word-source" which would be entered into the database as "word", "source",
and "word-source".  What are the locations for those words, then?

More random thoughts:
Should phrase searching look for punctuation that is supplied in the query? 
For example, would the phrase search "scherpbier, andrew" be rewritten to
"scherpbier andrew"?
-- 
Andrew Scherpbier <[EMAIL PROTECTED]>
Contigo Software <http://www.contigo.com/>
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to