Andreas Hartmann schrieb:
I volunteered to take the lead in the implementation of the "tag cloud" feature (see [1]).

I added a first version to the contributions area. The tag cloud is visible in the defaultfiredocs publication.


Some initial ideas:

IMO it makes sense to use the Dublin Core element "subject" to assign tags to a document [2].

I hard-coded this element for the moment.

Definition: "The topic of the resource."
Comment: "Typically, the subject will be represented using keywords, key phrases, or classification codes. Recommended best practice is to use a controlled vocabulary. To describe the spatial or temporal topic of the resource, use the Coverage element."

I guess this can be made configurable, we could just use the DC subject as the default. Since tags can contain spaces, we should use multiple meta data values to store multiple tags. A nice GUI for this has to be implemented. Would it be sufficient to extend the standard meta data GUI to allow entering multiple values, or do we need a dedicated tag management GUI? I'd suggest to start with the existing meta data GUI.

I didn't take care of the multi-value handling yet. The tags are just the terms which are indexed by Lucene. TODO: Define the meta element values as keyword index fields instead of text fields to support phrases (multi-word terms).

Finding all documents with a certain tag is rather simple since all meta data are indexed.

I used the standard search for this purpose. I had to extend the lucene module sitemap with a "raw" query type. The query looks like this:

  \{http\://purl.org/dc/elements/1.1/\}subject:foobar

It's rather ugly that this appears in the search box, maybe we have to add a dedicated meta data search or use another concept for special search terms.


The real challenge is to generate a list of all existing tags.

Maybe there is a performant way to generate the cloud using the index, e.g. via a wildcard query. But this still needs some postprocessing, so we'll probably have to cache the tag cloud.

Lucene allows to enumerate all terms for a particular field. To filter the language, I had to add a loop which searches the index for each term. I guess this takes quite a lot of time. Maybe someone knows a better solution? Or maybe a new version of Lucene has a more flexible API for term enumeration?

If you omit the language parameter of the IndexTermsGenerator, the language filtering is skipped and the listing of the terms is probably pretty fast.


If Lucene doesn't help, we have another nifty feature for this purpose: the RepositoryListener interface. By registering a listener with the repository, we can extract the tags of a document when it is saved, and update the tag cloud accordingly. The cloud also has to be updated when a document is removed. The details are a bit tricky (concurrency, queuing), but I think there's nothing that can't be solved. In this case we have to store the tag cloud. My first idea would be to use a dedicated document for this purpose.

I'd prefer the dynamic generation using Lucene, though, because otherwise we store redundant information in the repository which always carries a certain risk.

I think we can use Lucene. No need for the repository listening.

Another issue is supporting the user when she enters the tags. The system should present a list of existing tags, possibly with some kind of autocomplete functionality. But I guess when we manage to generate the cloud, this feature can easily be added.

I didn't tackle this issue yet.

Any comments and improvements are greatly appreciated!

-- Andreas


--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to