Martin,

We use interMedia Text to index and query up to about 10-15 million CLOB
documents (up to 5KB each).  We're on 8.1.6.0.0 under Win2k - 2 550MHz CPUs,
2GB RAM, 18 36GB drives.

Because a domain index cannot be partitioned, we have the documents spread
across 5 tables (on 6 drives).  One is a 2 partition table (each partition
on its own drive) containing the current two months of docs, the other 4
hold the 4 prior months' docs.  We can query the entire 6 months of docs via
a Union View on them - even Contains() queries work fine on this view.

When we add a new month's partition, the prior month's partition gets turned
into a table (segment exchange).  The interMedia Text indexes on the
partitioned table and the new prior month are rebuilt.

Lately we've been getting about 3.5 million docs/month and the index rebuild
takes about 7 hours - that's 7 hrs. for the index on the prior month and 7
more hours for the index on the partitioned table, which only contains one
month of docs at that point.

Since we're adding docs every day, we sync the interMedia index every
morning.  Last night we added about 200,000 docs and it took about 3 hours
for the index to resync.  We don't use ctxsrv, but use CTX_DDL.Sync_Index.

When we get over about 4.5 million docs in a table, the resync really slows
down.  The in-memory part still happens at about 150 docs/sec, but when
interMedia writes to disk it slows down a bunch.  What took 3 hours today
will take 10 hours in a couple of weeks.

That's why I plan on spreading the DR$<>$I segment across multiple drives by
spreading the datafiles of its tablespace across those drives.

BTW, that brings up some performance points - be sure you cache the DR$<>$R
segment (use CACHE not CACHE READS, due to bugs in Oracle):

  Alter Table DR$<YourIndexName>$R Modify LOB (Data) (Cache) ;

Also ensure that your LOBs are out-of-line and stored in their own
segment(s) on drive(s) separate from the "regular" data.  Make sure that
your I_TABLE_CLAUSE, R_TABLE_CLAUSE, and I_INDEX_CLAUSE all specify
tablespaces on their own drives to spread the I/O out even further.  We're
getting 2GB more RAM on a new server, so I plan on caching the 900MB DR$<>$X
segment, which is the index on the DR$<>$I token table.

I've learned a lot about how interMedia Text processes different kinds of
queries by watching disk I/O on Win2k's Performance Monitor while I issue
various "flavors".  Our folks use lots of complex query terms with heavy use
of the Stemmer.  I've gotten them to switch from using tons of ORs to using
the Equivalence operator and we're getting much better results using NEAR
than simple ANDs.  Performance is very good, with CONTAINS queries returning
results in less than a second for terms that are rare in the docs, up to a
minute for terms that are common in lots (e.g. hundreds of thousands) of
docs.

If you're going to do synonym searches, you'd better start looking for a
good thesaurus - the one Oracle ships is pretty limited.  We've not found a
good one for the technical lingo our docs contain, so we don't do ABOUT
queries at this time.

Get familiar with CTX_Query.Explain, it will help you understand things like
what the Stemmer *really* does and how complex queries are parsed.

Hope this helps.

Jack

--------------------------------
Jack C. Applewhite
Database Administrator/Developer
OCP Oracle8 DBA
iNetProfit, Inc.
Austin, Texas
www.iNetProfit.com
[EMAIL PROTECTED]
(512)327-9068


-----Original Message-----
Kendall
Sent: Thursday, October 04, 2001 10:00 AM
To: Multiple recipients of list ORACLE-L


Hello all,

Although I have installed Intermedia as part of my general DBA duties before
I have not experienced any particular requirements on throughput rate or
indexing.

I need some information on being able to deal with large volumes of product
data (e.g. 1 million products in a retail application) and be able to
perform 'intelligent' searches against the metadata (things like
typographical error matching, synonyms etc.) as well as the more usual
parametric search (i.e. advanced search page with lots of metadata specific
fields).

Indexing time and max throughput are also of interest.

Any data based on experience would be appreciated.

Thanks

Martin

-- 
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- 
Author: Jack C. Applewhite
  INET: [EMAIL PROTECTED]

Fat City Network Services    -- (858) 538-5051  FAX: (858) 538-5051
San Diego, California        -- Public Internet access / Mailing Lists
--------------------------------------------------------------------
To REMOVE yourself from this mailing list, send an E-Mail message
to: [EMAIL PROTECTED] (note EXACT spelling of 'ListGuru') and in
the message BODY, include a line containing: UNSUB ORACLE-L
(or the name of mailing list you want to be removed from).  You may
also send the HELP command for other information (like subscribing).

Reply via email to