> Firstly, why do we need two index formats? I'm the first to admit that
> the current Librarian index format is limited - way too limited - but
> why do we need two?
>
In fact, I didn't think you would agree to change librarian ... visibly, I was 
wrong :)


> The main changes I would make to the librarian 
> format right now would be:
> - Support splitting. (This is relevant to file indexes)
> - Include word indexes to allow for adjacent word searches. (This is
>   relevant to file indexes too, because you may want to search for
>   adjacent words in a title).
> - Maybe include some amount of metadata - functional (mime type), or
>   theoretical (category, dublin core...), or other (activelinks?). (This
>   is definitely relevant to file indexes).
> - Include the filename in the index. Possibly using negative word
>   indexes to indicate "in the filename" words; it must be possible to
>   distinguish between matches in the page title and matches in the
>   content. (This is also relevant to both web page indexes and file
>   indexes, though especially to the latter).
>
I will try to update my format proposal as soon as possible (probably this 
evening) to allow this.

Just regarding finding adjacent words, I assume a good solution would be to 
index in which document words are, and in which position ? (As suggested in 
http://en.wikipedia.org/wiki/Inverted_index ?)


> I am quite happy to change the format. Indeed it needs significant
> changes.
>
> Indexes, like all files, are automatically compressed, so don't worry
> too much about it being overly verbose.
>
Ok


> Now, you are proposing additional fields: firstly, the size of the
> content (this isn't especially relevant to web page indexes),
>
Yes, it was initally designed mostly for binary files, but even for little 
files like HTML files, I think it can be interresting to let user forsee how 
much he will download.


> and the 
> length of the file if it is audio or video. Both are perfectly
> reasonable extensions IMHO. If we are going to support metadata we 
> should support a range of metadata; we will need support for a category,
> (probably tied to a specific site), at least, and this is a very woolly
> and arbitrary thing.
>
I agree that it's a wolly and arbitrary thing, and I think most of the users 
won't even spend time to define their files categories (that's why I've put 
this as an option).
In fact, in Fuqid replacement, I thought letting user to define himself 
category for a given file. But for more usability, it would imply to allow 
user to change category of many files at the same time.


> An explicit aim of your index format is to be able to index the contents
> of text-based files by words. This is a good thing, but if you are going
> to do that, then you should define a format, (preferably with some of
> the details of splitting indexes worked out), and make Librarian and
> Spider use it. 
>
Ok, so if my next format proposal is right for everybody, I'll try to adapt 
Librarian and Spider.

Regarding Spider, in a first time, it would only be a basic version / 
adaptation, only indexing HTML files. As I will need to create a set of 
filters to extract metadata and words for the Fuqid replacement, I could 
reuse them later in Spider.


> Metadata can be shown next to matches, or it can be used 
> to narrow down searches.
>
For Librarian ? Ok, I don't think it will be a real problem.


> And I honestly don't care whether it is XML. I see no reason to take
> strenuous efforts to keep back compatibility, but filters can be written
> easily enough if need be.
>
So you suggest to completely forget current format ? In that case, I presume 
it will be easier for me to write an adapted version of Librarian and 
Spider :)


> On Fri, Jun 02, 2006 at 01:15:25AM +0200, Jerome Flesch wrote:
> > Hello,
> >
> > I designed an index format for the Fuqid replacement, and I wish to have
> > your opinion:
> >
> > http://wiki.freenetproject.org/AnotherFreenetIndexFormat
> >
> >
> > --
> > Jerome Flesch.
> > _______________________________________________
> > Devl mailing list
> > Devl@freenetproject.org
> > http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

-- 
Jerome Flesch.
_______________________________________________
Devl mailing list
Devl@freenetproject.org
http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl

Reply via email to