[freenet-dev] Index format proposal

Matthew Toseland Fri, 2 Jun 2006 19:47:57 +0100

On Fri, Jun 02, 2006 at 08:25:39PM +0200, Jerome Flesch wrote:
> > Firstly, why do we need two index formats? I'm the first to admit that
> > the current Librarian index format is limited - way too limited - but
> > why do we need two?
> >
> In fact, I didn't think you would agree to change librarian ... visibly, I 
> was 
> wrong :)


:)
> 
> 
> > The main changes I would make to the librarian 
> > format right now would be:
> > - Support splitting. (This is relevant to file indexes)
> > - Include word indexes to allow for adjacent word searches. (This is
> >   relevant to file indexes too, because you may want to search for
> >   adjacent words in a title).
> > - Maybe include some amount of metadata - functional (mime type), or
> >   theoretical (category, dublin core...), or other (activelinks?). (This
> >   is definitely relevant to file indexes).
> > - Include the filename in the index. Possibly using negative word
> >   indexes to indicate "in the filename" words; it must be possible to
> >   distinguish between matches in the page title and matches in the
> >   content. (This is also relevant to both web page indexes and file
> >   indexes, though especially to the latter).
> >
> I will try to update my format proposal as soon as possible (probably this 
> evening) to allow this.

I'm sorry, the above was incomprehensible because of an unforeseen
double entendre on "word indexes". The next version of the librarian
index format was to be something like this:

word 32 (17,23,99) 33 (11,-2)

I.e. the word occurs in URIs number 32 and 33. Each of these has a list
of integers. The integers are the index, or less confusingly position, of
the word, within the stream of words that is the document. This is a
counter - the first word is 0, the second word is 1 etc. (Excluding any
non-text content e.g. html tags). We would use a negative number to
indicate that the word was not in the content but in the title. (That's
not implemented in Librarian, I just came up with it).

Does this make the second and last points above make sense?

> Just regarding finding adjacent words, I assume a good solution would be to 
> index in which document words are, and in which position ? (As suggested in 
> http://en.wikipedia.org/wiki/Inverted_index ?)

Right.
> 
> > I am quite happy to change the format. Indeed it needs significant
> > changes.
> >
> > Indexes, like all files, are automatically compressed, so don't worry
> > too much about it being overly verbose.
> >
> Ok
> 
> > Now, you are proposing additional fields: firstly, the size of the
> > content (this isn't especially relevant to web page indexes),
> >
> Yes, it was initally designed mostly for binary files, but even for little 
> files like HTML files, I think it can be interresting to let user forsee how 
> much he will download.

Yes, it's harmless enough.
> 
> > and the 
> > length of the file if it is audio or video. Both are perfectly
> > reasonable extensions IMHO. If we are going to support metadata we 
> > should support a range of metadata; we will need support for a category,
> > (probably tied to a specific site), at least, and this is a very woolly
> > and arbitrary thing.
> >
> I agree that it's a wolly and arbitrary thing, and I think most of the users 
> won't even spend time to define their files categories (that's why I've put 
> this as an option).
> In fact, in Fuqid replacement, I thought letting user to define himself 
> category for a given file. But for more usability, it would imply to allow 
> user to change category of many files at the same time.

Well, for freesites, I expect categories to be quite important - but the
category would likely be assigned by a trusted author, such as TFE, or
at least it would be using a standard scale... or it could just be a
small amount of free text included by the site owner himself. But I know
CofE was thinking along the lines of providing a database of sites with
his own descriptions for them which could then be aggregated...
> 
> > An explicit aim of your index format is to be able to index the contents
> > of text-based files by words. This is a good thing, but if you are going
> > to do that, then you should define a format, (preferably with some of
> > the details of splitting indexes worked out), and make Librarian and
> > Spider use it. 
> >
> Ok, so if my next format proposal is right for everybody, I'll try to adapt 
> Librarian and Spider.

Right. Thanks for your thoroughness, I hope that it doesn't result in
your not having time to ship the primary finished product (the GUI
searching/sharing tool itself).
> 
> Regarding Spider, in a first time, it would only be a basic version / 
> adaptation, only indexing HTML files. As I will need to create a set of 
> filters to extract metadata and words for the Fuqid replacement, I could 
> reuse them later in Spider.

Right. I see no reason why your filesharing tool cannot link directly
into freesites if they haven't been excluded from the search.
> 
> > Metadata can be shown next to matches, or it can be used 
> > to narrow down searches.
> >
> For Librarian ? Ok, I don't think it will be a real problem.

Yes, for google-style searches. It might be worth thinking about for
filesharing type searches too.
> 
> > And I honestly don't care whether it is XML. I see no reason to take
> > strenuous efforts to keep back compatibility, but filters can be written
> > easily enough if need be.
> >
> So you suggest to completely forget current format ? In that case, I presume 
> it will be easier for me to write an adapted version of Librarian and 
> Spider :)

:) I have no problem with that, and cyberdo hasn't been around much
lately (he wrote librarian IIRC).
> 
> 
> > On Fri, Jun 02, 2006 at 01:15:25AM +0200, Jerome Flesch wrote:
> > > Hello,
> > >
> > > I designed an index format for the Fuqid replacement, and I wish to have
> > > your opinion:
> > >
> > > http://wiki.freenetproject.org/AnotherFreenetIndexFormat
> > >
> > >
> > > --
> > > Jerome Flesch.
> > > _______________________________________________
> > > Devl mailing list
> > > Devl at freenetproject.org
> > > http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
> 
> -- 
> Jerome Flesch.
> _______________________________________________
> Devl mailing list
> Devl at freenetproject.org
> http://emu.freenetproject.org/cgi-bin/mailman/listinfo/devl
> 

-- 
Matthew J Toseland - toad at amphibian.dyndns.org
Freenet Project Official Codemonkey - http://freenetproject.org/
ICTHUS - Nothing is impossible. Our Boss says so.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20060602/a2466b78/attachment.pgp>

[freenet-dev] Index format proposal

Reply via email to