Hi Tony,
Your code looks fine to me.  I'm not sure what you timed - the whole app run, 
just indexing, indexing + optimizing...  If you times indexing + optimizing, 
leave optimization out of the timer.  How long do you think this should take?  
Try setting maxBufferedDocs to 90.
 
Otis 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Tony Qian <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Thursday, April 12, 2007 11:23:36 AM
Subject: Index performance

All,

Sorry for long email. I have two questions on indexing. My data consists of 
an id, short headline and story text. Story text has some html tags. Here is 
an example.

In early 2005, it seemed that Shamita Shetty had finally arrived after a 
high profile debut in <i>Mohabbatein</i> [2000]. <br /><br />With 3 of her 
films releasing in the first half of 2005, <i>Bewafa, Zeher </i>and 
<i>Fareb</i>, and the first two ending up making good money, it seemed that 
the gorgeous girl had finally started making her presence felt. <i>Zeher</i> 
helped her being recognized as an actor and her fans had all the reasons to 
believe that they would be seeing more of her in the coming months. <br 
/><br />Surprisingly there has been absolutely no movement ever since then 
from Shamita&#39;s end as she hasn&#39;t had a single release in almost 2 
years now. All of this would change though with the arrival of <i>Cash</i> 
where she is one of the leading ladies apart from Esha Deol and Dia Mirza. 
<br /><br />An action thriller popcorn entertainer, the film is directed by 
Anubahv Sinha of <i>Dus</i> fame and stars Ajay Devgan, Suneil Shetty, 
Ritesh Deshmukh and Zayed Khan in the lead.<br /><br />

I tried to index it. It took from 7-10 seconds to index about 90 documents. 
Here is my code:

  static void indexContents(IndexWriter writer, List storyContentList)
    throws IOException {
    if (storyContentList != null && storyContentList.size() != 0) {
        try {
            Iterator itr = storyContentList.iterator();
            while (itr.hasNext()){
                StoryContents content = (StoryContents) itr.next();
                Document document = new Document();
                document.add(new Field("storyText", content.getStoryText(),
                             Field.Store.YES, Field.Index.TOKENIZED));
                document.add(new Field("storyIdentity", 
String.valueOf(content.getStoryIdentity()),
                             Field.Store.YES, Field.Index.NO));
                document.add(new Field("headline1", 
String.valueOf(content.getHeadline1()),
                             Field.Store.YES, Field.Index.NO));
                writer.addDocument(document);
            }
        }catch(Exception ex){
             System.out.println(" caught a " + ex.getClass() );
        }
    }
  }

I opened one IndexWriter at very beginning
IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), 
true);

I called optimize and closed IndexWriter after indexing documents.
writer.optimize();
writer.close();

My question is why it took so long. Do I need to follow the instruction of 
"How can I index HTML documents?" in FAQ from Lucene web site?

Another question is if I can delete document based on storyIndentity field ( 
using IndexReader.deleteDocuments(term)). Since storyIdentity field is not 
indexed, is there any performance issue or I should index it too (and store 
it)?

Appreciate your help.

Tony

_________________________________________________________________
Mortgage rates near historic lows. Refinance $200,000 loan for as low as 
$771/month* 
https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h27f8&disc=y&vers=689&s=4056&p=5117


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to