Re: Almost parallel indexes

Nixon Fri, 28 Sep 2007 06:10:59 -0700

Guys

I encounter such condition of having two indexes
for one forum related application which had

Topics and its posts which were mapped in db in two different tables inrelational database.


But i had decided to have one index

like this

topicId , topicName , timestamp , postId , postedcomment , ISTOPICFLAG

ISTOPICFLAG is set to 1 for topic and 0 for posts duplicating the topicdata

which was need for UI display

I used query AND ISTOPICFLAG  =1 for only topic and  ISTOPICFLAG  =0
for posts

The application works fine , but frequent updates to index is slower it.

Regards

Nixon

----- Original Message -----From: "Erick Erickson" <[EMAIL PROTECTED]>

To: <java-user@lucene.apache.org>
Sent: Friday, September 28, 2007 5:43 AM
Subject: Re: Almost parallel indexes

OK, this isn't well thought out, more the first thing that
pops to mind...

You're right, Lucene doesn't do joins. But would it serve
to keep two indexes? One the slow-changing stuff
and one the fast-changing stuff. They are related by
some *external*  (as in "not the Lucene doc id)
field.

You'd have to custom roll something that searched
across both indexes. Depending on your search
semantics this may be hard or easy. It would be
easy if your search were simple, something like
(stuff in the fast-changing index) OR/AND/NOT
(stuff in the slow-changing index). Handling
(some in the fast) AND/OR/NOT (some in the
slow) would be much harder......

The idea is to collect a set of your external doc
IDs, then use TermEnum/TermDocs in each
index to get at the underlying document parts.

Lots depends upon how many hits you expect. 100 is
one thing, 1,000,000 is another.

This, of course, adds a layer of complexity that makes
things harder, but it might be worth a shot.

Disclaimer: I haven't personally done something like this,
so take it for what it's worth.

Best
Erick

On 9/27/07, Tim Sturge <[EMAIL PROTECTED]> wrote:


Hi,

I have an index which contains two very distinct types of fields:

- Some fields are large (many term documents) and change fairly slowly.
- Some fields are small (mostly titles, names, anchor text) and change
fairly rapidly.

Right now I keep around the large fields in raw form and when the small
fields change, I retokenize the large and the small fields together. The
problem is that this retokenization is sucking up most of my CPU time,

making the indexing process too slow (this index needs to track changesinalmost real time; I'm using one of the reopen() patches from LUCENE-743in

JIRA to achieve this).

I can't really use ParallelReader to keep the indexes the same; it
requires me to add documents to both indexes which means I have to
retokenize the large fields anyway. I would want to do a "join" on an
external id, and as far as I can tell, Lucene doesn't support that.

Alternatively, what I'd like is a way to either store a pre-tokenized

version of the large fields, or to be able to add fields to a documentthat

come from an existing document in the index.

I suspect there is more to this question than meets the eye, but I'd be
interested in any strategies that people have used in the past.

Thanks,

Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--




--


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Almost parallel indexes

Reply via email to