On Fri, 02 Jun 2006 00:33:23 +0100, Matthew Toseland wrote:

> Firstly, why do we need two index formats? I'm the first to admit that the
> current Librarian index format is limited - way too limited - but why do
> we need two? The main changes I would make to the librarian format right
> now would be:
> - Support splitting. (This is relevant to file indexes) - Include word
> indexes to allow for adjacent word searches. (This is
>   relevant to file indexes too, because you may want to search for
>   adjacent words in a title).
> - Maybe include some amount of metadata - functional (mime type), or
>   theoretical (category, dublin core...), or other (activelinks?). (This
>   is definitely relevant to file indexes).
> - Include the filename in the index. Possibly using negative word
>   indexes to indicate "in the filename" words; it must be possible to
>   distinguish between matches in the page title and matches in the
>   content. (This is also relevant to both web page indexes and file
>   indexes, though especially to the latter).
> 
> I am quite happy to change the format. Indeed it needs significant
> changes.
> 
> Indexes, like all files, are automatically compressed, so don't worry too
> much about it being overly verbose.
> 
> Now, you are proposing additional fields: firstly, the size of the content
> (this isn't especially relevant to web page indexes), and the length of
> the file if it is audio or video. Both are perfectly reasonable extensions
> IMHO. If we are going to support metadata we should support a range of
> metadata; we will need support for a category, (probably tied to a
> specific site), at least, and this is a very woolly and arbitrary thing.
> 
> An explicit aim of your index format is to be able to index the contents
> of text-based files by words. This is a good thing, but if you are going
> to do that, then you should define a format, (preferably with some of the
> details of splitting indexes worked out), and make Librarian and Spider
> use it. Metadata can be shown next to matches, or it can be used to narrow
> down searches.
> 
> And I honestly don't care whether it is XML. I see no reason to take
> strenuous efforts to keep back compatibility, but filters can be written
> easily enough if need be.

I recall Frost having a nasty bug that caused it to crash whenever it
encountered a message malformed in a special way due to the parser not
handling error cases correctly. Using XML allows one to use existing XML
libraries for parsing instead of having to write a new parser, making it
much less likely that such unpleasantness occurs again. This is especially
important for non-Java programs, since they can easily develop far more
serious symptoms than simply crashing.

It also allows for trivial backwards-compatible extension: simply state
that a program should ignore all tags and attributes it doesn't
understand, and you can extend the format as needed while the old programs
will still keep on working.



Reply via email to