[freenet-dev] Re: Index format proposal

Jusa Saari Fri, 02 Jun 2006 12:04:32 +0300

On Fri, 02 Jun 2006 00:33:23 +0100, Matthew Toseland wrote:

> Firstly, why do we need two index formats? I'm the first to admit that the
> current Librarian index format is limited - way too limited - but why do
> we need two? The main changes I would make to the librarian format right
> now would be:
> - Support splitting. (This is relevant to file indexes) - Include word
> indexes to allow for adjacent word searches. (This is
>   relevant to file indexes too, because you may want to search for
>   adjacent words in a title).
> - Maybe include some amount of metadata - functional (mime type), or
>   theoretical (category, dublin core...), or other (activelinks?). (This
>   is definitely relevant to file indexes).
> - Include the filename in the index. Possibly using negative word
>   indexes to indicate "in the filename" words; it must be possible to
>   distinguish between matches in the page title and matches in the
>   content. (This is also relevant to both web page indexes and file
>   indexes, though especially to the latter).
> 
> I am quite happy to change the format. Indeed it needs significant
> changes.
> 
> Indexes, like all files, are automatically compressed, so don't worry too
> much about it being overly verbose.
> 
> Now, you are proposing additional fields: firstly, the size of the content
> (this isn't especially relevant to web page indexes), and the length of
> the file if it is audio or video. Both are perfectly reasonable extensions
> IMHO. If we are going to support metadata we should support a range of
> metadata; we will need support for a category, (probably tied to a
> specific site), at least, and this is a very woolly and arbitrary thing.
> 
> An explicit aim of your index format is to be able to index the contents
> of text-based files by words. This is a good thing, but if you are going
> to do that, then you should define a format, (preferably with some of the
> details of splitting indexes worked out), and make Librarian and Spider
> use it. Metadata can be shown next to matches, or it can be used to narrow
> down searches.
> 
> And I honestly don't care whether it is XML. I see no reason to take
> strenuous efforts to keep back compatibility, but filters can be written
> easily enough if need be.


I recall Frost having a nasty bug that caused it to crash whenever it
encountered a message malformed in a special way due to the parser not
handling error cases correctly. Using XML allows one to use existing XML
libraries for parsing instead of having to write a new parser, making it
much less likely that such unpleasantness occurs again. This is especially
important for non-Java programs, since they can easily develop far more
serious symptoms than simply crashing.

It also allows for trivial backwards-compatible extension: simply state
that a program should ignore all tags and attributes it doesn't
understand, and you can extend the format as needed while the old programs
will still keep on working.

[freenet-dev] Re: Index format proposal

Reply via email to