[freenet-dev] Index format proposal

Matthew Toseland Fri, 2 Jun 2006 00:33:23 +0100

Firstly, why do we need two index formats? I'm the first to admit that
the current Librarian index format is limited - way too limited - but
why do we need two? The main changes I would make to the librarian
format right now would be:
- Support splitting. (This is relevant to file indexes)
- Include word indexes to allow for adjacent word searches. (This is
  relevant to file indexes too, because you may want to search for
  adjacent words in a title).
- Maybe include some amount of metadata - functional (mime type), or
  theoretical (category, dublin core...), or other (activelinks?). (This
  is definitely relevant to file indexes).
- Include the filename in the index. Possibly using negative word
  indexes to indicate "in the filename" words; it must be possible to
  distinguish between matches in the page title and matches in the
  content. (This is also relevant to both web page indexes and file
  indexes, though especially to the latter).

I am quite happy to change the format. Indeed it needs significant
changes.

Indexes, like all files, are automatically compressed, so don't worry
too much about it being overly verbose.

Now, you are proposing additional fields: firstly, the size of the
content (this isn't especially relevant to web page indexes), and the
length of the file if it is audio or video. Both are perfectly
reasonable extensions IMHO. If we are going to support metadata we
should support a range of metadata; we will need support for a category,
(probably tied to a specific site), at least, and this is a very woolly
and arbitrary thing.

An explicit aim of your index format is to be able to index the contents
of text-based files by words. This is a good thing, but if you are going
to do that, then you should define a format, (preferably with some of
the details of splitting indexes worked out), and make Librarian and
Spider use it. Metadata can be shown next to matches, or it can be used
to narrow down searches.

And I honestly don't care whether it is XML. I see no reason to take
strenuous efforts to keep back compatibility, but filters can be written
easily enough if need be.

On Fri, Jun 02, 2006 at 01:15:25AM +0200, Jerome Flesch wrote:
> Hello,
> 
> I designed an index format for the Fuqid replacement, and I wish to have your 
> opinion:
> 
> http://wiki.freenetproject.org/AnotherFreenetIndexFormat
> 
> 
> -- 
> Jerome Flesch.
> _______________________________________________
> Devl mailing list
> Devl at freenetproject.org
> http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
> 

-- 
Matthew J Toseland - toad at amphibian.dyndns.org
Freenet Project Official Codemonkey - http://freenetproject.org/
ICTHUS - Nothing is impossible. Our Boss says so.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20060602/6be054de/attachment.pgp>

[freenet-dev] Index format proposal

Reply via email to