If it's any help, we (when I was at Quest) were removing items and adding them on an ongoing basis. We had indexes with 25+million items, around 15GB+ from memory. We'd add around 100K items a day, and some items were added more than once (which means remove then add). We had very good performance once we changed the MaxDocuments (to 100K, was 2.4billion / maxint) and the MergeFactor - merging really large blocks (I don't know the official term - segments?) made performance lousy, but you don't NEED to keep them all in one file.
Does that make sense? How much stuff are you putting into the book index? 500meg sounds about right, but an hour sounds a little high? -----Original Message----- From: Gautam Lad [mailto:[EMAIL PROTECTED] Sent: 12 February 2008 04:21 To: lucene-net-user@incubator.apache.org Subject: RE: Website and keeping data fresh Very good to know. Since not a lot of documents update during the course of the day and since we already re-build the index at night, I doubt it would hurt performance as you say :) Thanks, -- Gautam Lad -----Original Message----- From: Kurt Mackey [mailto:[EMAIL PROTECTED] Sent: February 11, 2008 10:10 PM To: lucene-net-user@incubator.apache.org Subject: RE: Website and keeping data fresh Nope. For that few writes, I can't see how you'd ever need to optimize during the day. You might run a few tests to find out how many writes cause search performance to degrade, but I suspect it's a lot. :) Optimizing is slow because it essentially writes all the index contents to a new index file. -Kurt -----Original Message----- From: Gautam Lad [mailto:[EMAIL PROTECTED] Sent: Monday, February 11, 2008 8:40 PM To: lucene-net-user@incubator.apache.org Subject: Website and keeping data fresh Hey all, I recently moved our company's external website to use dotLucene, and so far it's been great and is working flawlessly. I have several indices that I use to manage our website. Since our company is in the Book industry I have several indices that are used for various parts of the page. Eg. Our main catalog is searchable and so we have a "Book" index that can be searched by Title, Description, Author, etc. We also have an Author table that can be searched by First name, Last name, bio, etc. Finally we have a BookAuthor relationship table that is used when a Book is searched, the BookAuthor is searched to find out if the Book's authors have other books. The indices are as: Book (primary key: ISBN) - 160,000+ documents Author (primary key: AuthorID) - 60, 000+ documents BookAuthor (contains LinkID), 100, 000+ documents So far things are working great. The book index is about 500MB and is not a big overhead on our system. Now here's where the problem lies. To keep things fresh on the site, we have a nightly job that rebuilds entire index and then copies the data over to the production index folder (it takes about an hour to rebuild entire site and a min or two to copy things over). However, there will be times when the information will need to be updated almost live during the normal day-to-day hours. Say for example a book's description has changed. What I do is I delete the document and then re-add it. Unfortunately deleting and re-adding it to the index takes a few minutes and this is causing issues with information not being available when someone tries to look on the site. Here's the log from our background service that rebuild documents: 20080211 16:59:32 [Engine] [book] Deleting isbn(1554700310). Status: 1 20080211 16:59:32 [Engine] [book] [00:00:00:000] Getting table count 20080211 16:59:34 [Engine] [book] [00:00:02:156] Rows loaded 1 20080211 16:59:34 [Engine] [book] [00:00:02:156] Getting table schema 20080211 16:59:34 [Engine] [book] [00:00:02:218] Getting data reader 20080211 16:59:36 [Engine] [book] [16:59:36:000] Index dump started 20080211 16:59:36 [Engine] [book] [00:00:00:078] Total indexed: 1 20080211 16:59:36 [Engine] [book] [00:00:00:078] Optimizing index 20080211 17:02:23 [Engine] [book] [00:02:46:917] Index finished You can see from the moment it deleted the ISBN from the "book" index to when it finally added it back, it took only 4 seconds. But when the call to Writer.Optimize() is called it takes almost 2-1/2 minutes to optimize the index. Is optimizing the index even necessary at this point? Any help is greatly appreciated. -- Gautam Lad This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this This e-mail has been sent by one of the following wholly-owned subsidiaries of the BBC: BBC Worldwide, Registration Number: 1420028 England, Registered Address: Woodlands, 80 Wood Lane, London W12 0TT BBC World, Registration Number: 04514407 England, Registered Address: Woodlands, 80 Wood Lane, London W12 0TT BBC World Distribution Limited, Registration Number: 04514408, Registered Address: Woodlands, 80 Wood Lane, London W12 0TT