Re: Tag cloud

Andreas Hartmann Fri, 03 Jul 2009 14:29:26 -0700

Andreas Hartmann schrieb:

I volunteered to take the lead in the implementation of the "tag cloud"feature (see [1]).

I added a first version to the contributions area. The tag cloud isvisible in the defaultfiredocs publication.

Some initial ideas:
IMO it makes sense to use the Dublin Core element "subject" to assigntags to a document [2].


I hard-coded this element for the moment.

Definition: "The topic of the resource."
Comment: "Typically, the subject will be represented using keywords, keyphrases, or classification codes. Recommended best practice is to use acontrolled vocabulary. To describe the spatial or temporal topic of theresource, use the Coverage element."
I guess this can be made configurable, we could just use the DC subjectas the default. Since tags can contain spaces, we should use multiplemeta data values to store multiple tags. A nice GUI for this has to beimplemented. Would it be sufficient to extend the standard meta data GUIto allow entering multiple values, or do we need a dedicated tagmanagement GUI? I'd suggest to start with the existing meta data GUI.

I didn't take care of the multi-value handling yet. The tags are justthe terms which are indexed by Lucene. TODO: Define the meta elementvalues as keyword index fields instead of text fields to support phrases(multi-word terms).

Finding all documents with a certain tag is rather simple since all metadata are indexed.

I used the standard search for this purpose. I had to extend the lucenemodule sitemap with a "raw" query type. The query looks like this:


  \{http\://purl.org/dc/elements/1.1/\}subject:foobar

It's rather ugly that this appears in the search box, maybe we have toadd a dedicated meta data search or use another concept for specialsearch terms.

The real challenge is to generate a list of allexisting tags.
Maybe there is a performant way to generate the cloud using the index,e.g. via a wildcard query. But this still needs some postprocessing, sowe'll probably have to cache the tag cloud.

Lucene allows to enumerate all terms for a particular field. To filterthe language, I had to add a loop which searches the index for eachterm. I guess this takes quite a lot of time. Maybe someone knows abetter solution? Or maybe a new version of Lucene has a more flexibleAPI for term enumeration?

If you omit the language parameter of the IndexTermsGenerator, thelanguage filtering is skipped and the listing of the terms is probablypretty fast.

If Lucene doesn't help, we have another nifty feature for this purpose:the RepositoryListener interface. By registering a listener with therepository, we can extract the tags of a document when it is saved, andupdate the tag cloud accordingly. The cloud also has to be updated whena document is removed. The details are a bit tricky (concurrency,queuing), but I think there's nothing that can't be solved. In this casewe have to store the tag cloud. My first idea would be to use adedicated document for this purpose.
I'd prefer the dynamic generation using Lucene, though, becauseotherwise we store redundant information in the repository which alwayscarries a certain risk.


I think we can use Lucene. No need for the repository listening.

Another issue is supporting the user when she enters the tags. Thesystem should present a list of existing tags, possibly with some kindof autocomplete functionality. But I guess when we manage to generatethe cloud, this feature can easily be added.


I didn't tackle this issue yet.

Any comments and improvements are greatly appreciated!

-- Andreas


--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Tag cloud

Reply via email to