Oliver,

On Friday 27 August 2004 22:20, you wrote:
> Hi,
>
> I guess this one of the most often asked question on this mailing list, but
> hopefully my question is more specific, so that I can get some input from
> you.
>
> My project is to implement an agency system for newspapers. So I have to
> handle about 30 days of text and IPTC data. The later is taken from images
> provided by the agencies. I basically get a constant stream of text
> messages from the agencies (roughly 2000 per day per agency) and images
> (roughly 1000 per day per agency). I have to deal with 4 text and 6 image
> agencies. So my daily input is 8000 text messages and 6000 images. The
> extracted documents from these text messages and images have a size of
> about 1kb.
>
> The extraction of the data and converting them to Document objects is
> already finished and the search using lucence works like a charm. Brilliant
> software!
>
> But now to my questions. In order to understand what I am doing, like to
> talk a little about the kind of queries and data I have to deal with.
>
> *     Every message has a priority. An integer value ranging from 1 to 6.
> *     Every message has a receive date.
> *     Every message has an agency assigned, basically a unique string
> identifier for it.
> *     Every message has some header data, that is also indexed for refined
> searches.
> *     And of course the actual text included in the text message itself or
> the IPTC header of an image.
>
> Typically I have to kinds of queries.
>
> *     Typical relational queries
>
> *     Show every text messages from a certain agency in the last X days.

Probably good for a date filter, see the wiki on RangeQuery, and evt. my
previous message on filters (using 2nd index on constraining). Lucene has
no facilities for primary keys, so that is up to you.

> *     Show every image or text message with a higher priority then Y and
> from a certain period of time.

RangeQuery again for the priority.
One can store images in Lucene, but currently only in String format, ie.
they'll need some conversion. There was some talk on binary
objects (not too) recently, but that is still in development. I'd probably store the
images in a file system or in another db for now. OTOH, if you're willing
to help storing binary images lucene-dev is nearby.

> *     Fulltext search

Yes :)

> *     A real fulltext search over all elements using the full power of
> lucences query language.

Span queries are currently not supported by the query language,
you might have a look at the org.apache.lucene.search.spans package.

> It is absolutely no question anymore, that the later queries will be done
> using Lucene. But can the first type of query is the thing I am thinking
> about. Can this be done effeciently with Lucene? So far we use a system

Lucene can be as fast as relational databases, provided your lower level
java code on IndexReader plays nice with system resources like disk heads
and RAM.
That means using filters, sorting on index order before using an index
and evt. sorting on document number before retrieving stored fields.
Lucene's IndexSearcher for searching text queries is quite well behaved
in that respect. 

> that uses a SQL database engine for storing the relevant data and is used
> in these queries. But if Lucene is fast enough with these queries too, I am
> willing to skip the SQL database at all. But I have to remind, that I will
> be indexing about 400.000 messages per month.

To easily keep the primary keys in sync between the SQL db and Lucene,
I'd start by keeping the images and the full text only in the SQL db.
Lucene optimisations (needed after adding/deleting docs) copy all data
so it pays to keep the Lucene indexes small.

Later you might need multiple indexes, MultiSearcher, and occasionally
a merge of the indexes.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to