Re: simple search api (was Re: mimetype standardisation by testsets)

Mikkel Kamstrup Erlandsen Thu, 23 Nov 2006 07:26:42 -0800

2006/11/22, Magnus Bergman <[EMAIL PROTECTED]>:

I have constructed a in-house application which does pretty much
exactly what you describe (it doesn't yet speak dbus, but corba and
soap). Sadly I'm not allowed to release the source of this application,
but at least I can share some of my experience. (I haven't yet looked
closely on your source, so I might have misunderstood some things)



Great! To paraphrase Linus "Given enough eyeballs all <strike>bugs</strike>
specs are shallow"  :-)


If several search engines are available, the search manager lets the

client know of each search engine according to your proposal (right?).
I think it would be a better idea to present a list of indexes (of which
each search engine might provide several) to search in, but by default
search in all of them (if appropriate). I



Well, the search engines are not obliged to use a particular index format.
The indexes them selves can be of any format.


Instead of registering the the

search engine I think it's better to think in terms of creating a
session (which might still do exactly the same thing). Because this
should affect all appropriate search engines transparently. And because
it might be desired to alter some options for the session (language,
fussiness, search contexts and such).



So you have a search-manager-daemon or something that holds a session object
with user info; do I understand correctly?

In addition to this session object I have found it suitable to also

have a search object (created from a query) because applications might
construct very complicated queries. This object can then is passed
to countHits, and used for getting the hits. And also for getting
attributes of the hit (matching document, score, language and such).
(Note that a hit is not equivalent to a document.)



The problem with creating query objects like this, is that we are creating a
dbus api. Essentially you only have simple data types at your hand. No
objects - especially objects with methods on them :-) It would be possible
to create a helper lib in <insert favorite language + toolkit> to construct
queries conforming to the wasabi spec, but this would require separate libs
for gobject and qt. While this is by no means ruled out, I think we better
focus on the "bare" dbus api for now.

Daemon or no daemon, that is the question. This is a question that

without doubt will arise (it always does). First we need to clarify that
there is a difference between a daemon doing the indexing of document
(or rather detecting new documents needed to be indexed) and a daemon
performing the search (and possibly merging several searches). Most
search engines I use don't have a daemon for doing the searches
(instead the only provide a library), because that is seldom considered
required. Indexes are read only (then searching) so the common problems
daemons are used to solve are not present.



The situation at hand is that we have a  handful of desktop search engines,
all implemented as daemons, both handling searches and indexing. Having an
extra daemon on top of that handling the query one extra time before passing
it to the search subsystem seems overkill... Ideally I see the daemon/lib
(or even executable) to only be used as a means of obtaining a dbus object
path given a dbus interface name ("org.freedesktop.search.simple").

As you point out, having a separate daemon other than the indexer, is not
exactly standard (atleast not to my knowledge). Also a managing daemon is
likely to re-invent functionality dbus already provides IMHO.

My solution (which took me quite a while to develop) might seem overly

complicated at first, but I think it really isn't. It was to implement
all functionality (including caching and merging of searches) in a
library. That library can be used by an application to do everything.
Or the application can use it just to contact a daemon (which of course
also uses the very same library for everything it does). This also has
the nice side effect that daemons can be chained, so searches can span
over several computers (if it supports at least one network transparent
communication mechanism). I think it would also be a good idea for the
library to support plugins for different search engines/communication
mechanisms.



This is exactly what Wasabi aims to fix. Standardize  apis across search
engine implementations. What functionality should be on top of this - in
form of helper libraries/daemons should probably be punted for now (until
the api spec is set in stone atleast).

One of the plugins is the one using the dbus search

interface. Other plugins could be made for existing search engines like
Lucene, Swish(++|E), mnoGoSearch, Xapian, ht://Dig, Datapark,
(hyper)estraier, Glimpse, Namatzu, Sherlock Holmes and all the other.
Which would surely be a lot easier than convincing each of them to
implement a daemon which provides a dbus interface.



Well, what you are suggesting sounds like the opposite of the current goal.
If I understand right you suggest creating a wrapper lib for each possible
search backend, as opposed to the current idea - to promote a shared dbus
interface. I see it this way: Dbus is the de facto standard for desktop ipc.
It is actually really easy (and portable between toolkits) to expose a dbus
api. Implementing a backend for each (custom) communications api sounds like
a great deal more work, with possibility for more bugs...

It's only guesswork, but I will bet that is hard work maintaining a cross
platform library doing all this. If we restricted to one platform it would
be another deal.

One thing that English users seldom consider is the usages of several

languages. Which language is being used is important to know in order
to decide what stemming rules to use, and which stop-words use (in
English "the" is a stop-word while it in Swedish means tea and is
something that is adequate to search for). People using other languages
are very often multi lingual (using English as well). Therefore it is
interesting to know which language the query is in (search engines
might also be able to translate queries to search in document written
in different languages).


This is a good point. However I suggest leaving this up to the actual
implementations. After all it is an indexing time question what stemmer to
use when indexing a document...

Sorry for the late reply. I have been rather busy lately.

Cheers,
Mikkel

_______________________________________________
xdg mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/xdg

Re: simple search api (was Re: mimetype standardisation by testsets)

Reply via email to