RE: Using more than one index

Chris Hostetter Tue, 13 Jun 2006 10:45:29 -0700

: A document (in our case an xml that has many metadata) can have more
: than one date, each date with 2 attributes:


: <date type="document" art="geburt">00-00-1886</date>
:
: In the date index I have for every <date> in the input xml a document
: with fields: type (document |other), date, art (birthday | deportation |
: death...). For example if I merge all the dates that correspond to a
: document then the new type field will contain all the values. So if I

so make a seperate field for each type of date, lucene is very good at
supporting documents with heterogenous sets of fields -- and it's even
better if you use OMIT_NORMS (which makes perfect sense for a date field
where norm values are meaningless anyway)

: A couple of suggestions...
:
: 1) don't use multiple indexes.  create one index, with one document per
: "thing" you want to return (in this case it sounds like books) and index
: all of the relevent data about each thing in that doc.  If multiple
: people
: worked on a book, add all of their names to the same field.  addd all of
: the dates to the book doc -- if you need to distibguish the differnet
: types of dates, make a seaprete field for each type.
:
: If you *must* cross refrence...
:
: 2) make sure you aren't useing the Hits API to iterate over all the
: results when gathering IDs -- use a lower level api (like a
: HitCollector)
:
: 3) use the FieldCache to get the IDs instead of he stored Document
: fields.
:
: 4) don't extract full ID lists from all of then indexes and then search
: on one of the indexes again with the ID list ... use the ID lists
: generated from the supporting indexes (people and dates) to build a
: Filter
: that you can use when searching the main index.
:
:
:
: : Date: Mon, 12 Jun 2006 12:22:30 +0300
: : From: Mile Rosu <[EMAIL PROTECTED]>
: : Reply-To: java-user@lucene.apache.org
: : To: java-user@lucene.apache.org
: : Subject: Using more than one index
: :
: : Hello,
: :
: : We have an application dealing with historical books. The books have
: : metadata consisting of event dates, and person names among others.
: : The FullText, Person and Date indexes were split until we realized
: that
: : for a larger number of documents (400K) the combination of the
: : sequential search hits took a way too long time to complete (15 min).
: : The date index was built using the suggestion found at:
: : http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
: (big
: : thanks for the hint)
: :
: : Is there a recommended approach to combining results from different
: : indexes (with different fields)?
: :
: : The indexes structure:
: : MainIndex:
: :     Fields:
: :             @ID@ - keyword (document id)
: :             @FULLTEXT@ - tokenized (used for full text6 search)
: :             Ptitle - tokenized (used for full text publication title
: : search)
: :             Dtitle - tokenized (used for full text document title
: : search)
: :             Type - keyword - (used for document type)
: :
: : PersonIndex:
: :             @ID@ - keyword (document id == [EMAIL PROTECTED]@)
: :             Person - tokenized (full text person name search)
: : DateIndex:
: :             @ID@ - keyword (document id == [EMAIL PROTECTED]@)
: :             Date - date as YYYYMMDD - keyword
: :             Type - type of date (document date, birth day, etc...)
: :             @YYYY@ - year of date
: :             @YYYYMM@ - year and month of date
: :             @DDD@ - decade
: :             @CC@ - century of date
: :
: :
: : Eg:
: : If I want to search for documents that contain: person "John", full
: text
: : "book" and date: before 06/12/2005
: : Step 1:  search in personIndex for John - retrieve all @ID@ from the
: hit
: : list
: : Step 2: search in DateIndex for documents that have dates before
: : 06/12/2005 - retrieve id from the hit list
: : Step 3: search in mainIndex for "book" - retrieve all @ID@
: : Step 4: combine all the lists
: : Step 5: search mainIndex for documents with the @ID@ from the combined
: : id list
: :
: : Each search takes less then 1 second, but retrieving @ID@ from the
: index
: : takes a lot more - the time increases by the number of hits. This is
: : because when retrieving a field value from a document hit, the Lucene
: : engine loads all the fields from the index (the entire document). So
: if
: : in one search I get 300.000 hits cont, I have to iterate through all
: and
: : retrieve the @ID@ field value - this takes a lot of time.
: :
: : Regards,
: : Mile Rosu
: :
: : ---------------------------------------------------------------------
: : To unsubscribe, e-mail: [EMAIL PROTECTED]
: : For additional commands, e-mail: [EMAIL PROTECTED]
: :
:
:
:
: -Hoss
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Using more than one index

Reply via email to