RE: Using more than one index

Mile Rosu Tue, 13 Jun 2006 06:17:30 -0700

Hi Hoss,

Thanks for your quick answer. One of the problems left with the date is
this:


A document (in our case an xml that has many metadata) can have more
than one date, each date with 2 attributes:

Eg:

<date type="document" art="geburt">00-00-1886</date> 

In the date index I have for every <date> in the input xml a document
with fields: type (document |other), date, art (birthday | deportation |
death...). For example if I merge all the dates that correspond to a
document then the new type field will contain all the values. So if I
want to search for a document that has type:document art:birthday and
date between a and b then I won't get the correct results.

Regards,
Mile Rosu

-----Original Message-----
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 13, 2006 9:55 AM
To: [email protected]
Subject: Re: Using more than one index


A couple of suggestions...

1) don't use multiple indexes.  create one index, with one document per
"thing" you want to return (in this case it sounds like books) and index
all of the relevent data about each thing in that doc.  If multiple
people
worked on a book, add all of their names to the same field.  addd all of
the dates to the book doc -- if you need to distibguish the differnet
types of dates, make a seaprete field for each type.

If you *must* cross refrence...

2) make sure you aren't useing the Hits API to iterate over all the
results when gathering IDs -- use a lower level api (like a
HitCollector)

3) use the FieldCache to get the IDs instead of he stored Document
fields.

4) don't extract full ID lists from all of then indexes and then search
on one of the indexes again with the ID list ... use the ID lists
generated from the supporting indexes (people and dates) to build a
Filter
that you can use when searching the main index.



: Date: Mon, 12 Jun 2006 12:22:30 +0300
: From: Mile Rosu <[EMAIL PROTECTED]>
: Reply-To: [email protected]
: To: [email protected]
: Subject: Using more than one index
:
: Hello,
:
: We have an application dealing with historical books. The books have
: metadata consisting of event dates, and person names among others.
: The FullText, Person and Date indexes were split until we realized
that
: for a larger number of documents (400K) the combination of the
: sequential search hits took a way too long time to complete (15 min).
: The date index was built using the suggestion found at:
: http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
(big
: thanks for the hint)
:
: Is there a recommended approach to combining results from different
: indexes (with different fields)?
:
: The indexes structure:
: MainIndex:
:       Fields:
:               @ID@ - keyword (document id)
:               @FULLTEXT@ - tokenized (used for full text6 search)
:               Ptitle - tokenized (used for full text publication title
: search)
:               Dtitle - tokenized (used for full text document title
: search)
:               Type - keyword - (used for document type)
:
: PersonIndex:
:               @ID@ - keyword (document id == [EMAIL PROTECTED]@)
:               Person - tokenized (full text person name search)
: DateIndex:
:               @ID@ - keyword (document id == [EMAIL PROTECTED]@)
:               Date - date as YYYYMMDD - keyword
:               Type - type of date (document date, birth day, etc...)
:               @YYYY@ - year of date
:               @YYYYMM@ - year and month of date
:               @DDD@ - decade
:               @CC@ - century of date
:
:
: Eg:
: If I want to search for documents that contain: person "John", full
text
: "book" and date: before 06/12/2005
: Step 1:  search in personIndex for John - retrieve all @ID@ from the
hit
: list
: Step 2: search in DateIndex for documents that have dates before
: 06/12/2005 - retrieve id from the hit list
: Step 3: search in mainIndex for "book" - retrieve all @ID@
: Step 4: combine all the lists
: Step 5: search mainIndex for documents with the @ID@ from the combined
: id list
:
: Each search takes less then 1 second, but retrieving @ID@ from the
index
: takes a lot more - the time increases by the number of hits. This is
: because when retrieving a field value from a document hit, the Lucene
: engine loads all the fields from the index (the entire document). So
if
: in one search I get 300.000 hits cont, I have to iterate through all
and
: retrieve the @ID@ field value - this takes a lot of time.
:
: Regards,
: Mile Rosu
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Using more than one index

Reply via email to