> Firstly, why do we need two index formats? I'm the first to admit that > the current Librarian index format is limited - way too limited - but > why do we need two? > In fact, I didn't think you would agree to change librarian ... visibly, I was wrong :)
> The main changes I would make to the librarian > format right now would be: > - Support splitting. (This is relevant to file indexes) > - Include word indexes to allow for adjacent word searches. (This is > relevant to file indexes too, because you may want to search for > adjacent words in a title). > - Maybe include some amount of metadata - functional (mime type), or > theoretical (category, dublin core...), or other (activelinks?). (This > is definitely relevant to file indexes). > - Include the filename in the index. Possibly using negative word > indexes to indicate "in the filename" words; it must be possible to > distinguish between matches in the page title and matches in the > content. (This is also relevant to both web page indexes and file > indexes, though especially to the latter). > I will try to update my format proposal as soon as possible (probably this evening) to allow this. Just regarding finding adjacent words, I assume a good solution would be to index in which document words are, and in which position ? (As suggested in http://en.wikipedia.org/wiki/Inverted_index ?) > I am quite happy to change the format. Indeed it needs significant > changes. > > Indexes, like all files, are automatically compressed, so don't worry > too much about it being overly verbose. > Ok > Now, you are proposing additional fields: firstly, the size of the > content (this isn't especially relevant to web page indexes), > Yes, it was initally designed mostly for binary files, but even for little files like HTML files, I think it can be interresting to let user forsee how much he will download. > and the > length of the file if it is audio or video. Both are perfectly > reasonable extensions IMHO. If we are going to support metadata we > should support a range of metadata; we will need support for a category, > (probably tied to a specific site), at least, and this is a very woolly > and arbitrary thing. > I agree that it's a wolly and arbitrary thing, and I think most of the users won't even spend time to define their files categories (that's why I've put this as an option). In fact, in Fuqid replacement, I thought letting user to define himself category for a given file. But for more usability, it would imply to allow user to change category of many files at the same time. > An explicit aim of your index format is to be able to index the contents > of text-based files by words. This is a good thing, but if you are going > to do that, then you should define a format, (preferably with some of > the details of splitting indexes worked out), and make Librarian and > Spider use it. > Ok, so if my next format proposal is right for everybody, I'll try to adapt Librarian and Spider. Regarding Spider, in a first time, it would only be a basic version / adaptation, only indexing HTML files. As I will need to create a set of filters to extract metadata and words for the Fuqid replacement, I could reuse them later in Spider. > Metadata can be shown next to matches, or it can be used > to narrow down searches. > For Librarian ? Ok, I don't think it will be a real problem. > And I honestly don't care whether it is XML. I see no reason to take > strenuous efforts to keep back compatibility, but filters can be written > easily enough if need be. > So you suggest to completely forget current format ? In that case, I presume it will be easier for me to write an adapted version of Librarian and Spider :) > On Fri, Jun 02, 2006 at 01:15:25AM +0200, Jerome Flesch wrote: > > Hello, > > > > I designed an index format for the Fuqid replacement, and I wish to have > > your opinion: > > > > http://wiki.freenetproject.org/AnotherFreenetIndexFormat > > > > > > -- > > Jerome Flesch. > > _______________________________________________ > > Devl mailing list > > Devl@freenetproject.org > > http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl -- Jerome Flesch. _______________________________________________ Devl mailing list Devl@freenetproject.org http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl