[freenet-dev] Re: Index format proposal

Jerome Flesch Fri, 2 Jun 2006 20:40:35 +0200

> I recall Frost having a nasty bug that caused it to crash whenever it
> encountered a message malformed in a special way due to the parser not
> handling error cases correctly. Using XML allows one to use existing XML
> libraries for parsing instead of having to write a new parser, making it
> much less likely that such unpleasantness occurs again.
>
Er ... Currently, I didn't look at Frost source code, but Frost messages seem 
to be in XML, no ?


> This is especially 
> important for non-Java programs, since they can easily develop far more
> serious symptoms than simply crashing.
>
> It also allows for trivial backwards-compatible extension: simply state
> that a program should ignore all tags and attributes it doesn't
> understand, and you can extend the format as needed while the old programs
> will still keep on working.
>
Yes, but I always fear to see one day strange / useless extensions from 
programs made by someone else, and people asking me to add support for it ... 
(Ok, it will probably not happen ...)


>
> On Fri, 02 Jun 2006 00:33:23 +0100, Matthew Toseland wrote:
> > Firstly, why do we need two index formats? I'm the first to admit that
> > the current Librarian index format is limited - way too limited - but why
> > do we need two? The main changes I would make to the librarian format
> > right now would be:
> > - Support splitting. (This is relevant to file indexes) - Include word
> > indexes to allow for adjacent word searches. (This is
> >   relevant to file indexes too, because you may want to search for
> >   adjacent words in a title).
> > - Maybe include some amount of metadata - functional (mime type), or
> >   theoretical (category, dublin core...), or other (activelinks?). (This
> >   is definitely relevant to file indexes).
> > - Include the filename in the index. Possibly using negative word
> >   indexes to indicate "in the filename" words; it must be possible to
> >   distinguish between matches in the page title and matches in the
> >   content. (This is also relevant to both web page indexes and file
> >   indexes, though especially to the latter).
> >
> > I am quite happy to change the format. Indeed it needs significant
> > changes.
> >
> > Indexes, like all files, are automatically compressed, so don't worry too
> > much about it being overly verbose.
> >
> > Now, you are proposing additional fields: firstly, the size of the
> > content (this isn't especially relevant to web page indexes), and the
> > length of the file if it is audio or video. Both are perfectly reasonable
> > extensions IMHO. If we are going to support metadata we should support a
> > range of metadata; we will need support for a category, (probably tied to
> > a specific site), at least, and this is a very woolly and arbitrary
> > thing.
> >
> > An explicit aim of your index format is to be able to index the contents
> > of text-based files by words. This is a good thing, but if you are going
> > to do that, then you should define a format, (preferably with some of the
> > details of splitting indexes worked out), and make Librarian and Spider
> > use it. Metadata can be shown next to matches, or it can be used to
> > narrow down searches.
> >
> > And I honestly don't care whether it is XML. I see no reason to take
> > strenuous efforts to keep back compatibility, but filters can be written
> > easily enough if need be.
>
> _______________________________________________
> Devl mailing list
> Devl at freenetproject.org
> http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

-- 
Jerome Flesch.

[freenet-dev] Re: Index format proposal

Reply via email to