RE: Indexing multiple languages

2005-06-02 Thread Bob Cheung
Hi Erik, I am a new comer to this list and please allow me to ask a dumb question. For the StandardAnalyzer, will it have to be modified to accept different character encodings. We have customers in China, Taiwan and Hong Kong. Chinese data may come in 3 different encoding: Big5, GB and UTF8.

Re: Indexing and Hit Highlighting OCR Data

2005-06-02 Thread Chris Hostetter
This is a pretty interesting problem. I envy you. I would avoid the existing highlighter for your purposes -- highlighting in token space is a very differnet problem from "highlihgting" in 2D space. based on the XML sample you provided, it looks like your XML files are allready a "tokenized" fo

RE: how long should optimizing take

2005-06-02 Thread Angelov, Rossen
I'll make sure no indexing is started before the optimization is done. Most likely Sunday will be the optimization day for the indexes and every other night the documents will be added to the index. Only searching will be available through the web service while optimizing, but this should not be a

Re: how long should optimizing take

2005-06-02 Thread Dan Armbrust
You should be careful, however, not to end up with two VM instances each trying to open an index writer at the same time - one of them is going to fail. Aka, if someone using your web interface tries to add a new document to the index while you have the optimizer running standalone, the web i

RE: Indexing multiple languages

2005-06-02 Thread Tansley, Robert
Thanks all for the useful comments. It seems that there are even more options -- 4/ One index, with a separate Lucene document for each (item,language) combination, with one field that specifies the language 5/ One index, one Lucene document per item, with field names that include the language

RE: how long should optimizing take

2005-06-02 Thread Angelov, Rossen
Thanks for the suggestion, Jian Chen's idea is very similar too. Probably optimizing that often is not necessary and not that critical for speeding up the searches. I'll try changing the index process not to optimize at all and execute the optimization independently of the indexing on a weekly bas

RE: calculate wi = tfi * IDFi for each document.

2005-06-02 Thread Andrew Boyd
Ok. So if I get 10 Documents back from a search and I want to get the top 5 weighted terms for each of the 10 documents what API call should I use? I'm unable to find the connection between Similarity and a Document. I know I'm missing the elephant that must be in the middle of the room. Or

Indexing and Hit Highlighting OCR Data

2005-06-02 Thread Corey Keith
Hi, I am involved in a project which is trying to provide searching and hit highlighting on the scanned image of historical newspapers. We have an XML based OCR format. A sample is below. We need to index the CONTENT attribute of the String element which is the easy part. We would like to

Re: using the CachingWrapperFilter

2005-06-02 Thread Erik Hatcher
On Jun 2, 2005, at 11:56 AM, M. Mokotov wrote: Hi Erik, Thank you very much for your reply. The problem is I need only the caching without any date or query functionality. Since the only two constructable-without-wrapping Filters are the QueryFilter and DateFilter, I need either to have a dum

Re: how long should optimizing take

2005-06-02 Thread jian chen
Hi, optimize() merges the index segments into one single index segment. In your case, I guess the 2G index segment is quite large, if you merge it with any other small index segments, the merging process definitely will be slow. I think the performance should be ok without calling optimize(). Mor

Re: how long should optimizing take

2005-06-02 Thread Dan Armbrust
I would run your optimize process in a separate thread, so that your web client doesn't have to wait for it to return. You may even want to set the optimize part up to run on a weekly schedule, at a low load time. I probably wouldn't reoptimize after every 30 documents, on an index that size.

RE: how long should optimizing take

2005-06-02 Thread Angelov, Rossen
I would like to bring that issue up again as I haven't resolved it yet and haven't found what's causing it. Any help, ideas or sharing experience are welcome! Thanks, Ross -Original Message- From: Angelov, Rossen Sent: Friday, May 27, 2005 10:42 AM To: 'java-user@lucene.apache.org' Subj

RE: using the CachingWrapperFilter

2005-06-02 Thread M. Mokotov
Hi Erik, Thank you very much for your reply. The problem is I need only the caching without any date or query functionality. Since the only two constructable-without-wrapping Filters are the QueryFilter and DateFilter, I need either to have a dummy filter, or to have a constructor for the Caching

Re: Adding to the termFreqVector

2005-06-02 Thread Grant Ingersoll
I don't think you need to add to a term vector. I guess what I was thinking (and this is just a guess not knowing your architecture) is that you could have a TokenStreamFilter/Analyzer that took in the appropriate term vectors, along with your runtime incremental information. Then, as you are "in

RE: Lucene and Documentum

2005-06-02 Thread Andrew Boyd
Hi Santanu, What type of API does Documentum provide? Does it expose the meta-data for all of its stored documents? Does it provide access to the actuall documents? Thanks, Andrew -Original Message- From: Santanu Dutta <[EMAIL PROTECTED]> Sent: Jun 2, 2005 4:53 AM To: java-user@luc

RE: calculate wi = tfi * IDFi for each document.

2005-06-02 Thread Max Pfingsthorn
Hi, DefaultSimilarity uses exactly this weighting scheme. Makes sense since it's a pretty standard relevance measure... Bye! max -Original Message- From: Andrew Boyd [mailto:[EMAIL PROTECTED] Sent: Thursday, June 02, 2005 11:39 To: java-user@lucene.apache.org Subject: calculate wi = tfi

RE: Lucene and Documentum

2005-06-02 Thread Santanu Dutta
Hi Andrew I have experience using lucene for Content Management System (Content Repository). We are using different file types and different locale(fe,de,us,ru) Santanu -Original Message- From: Andrew Boyd [mailto:[EMAIL PROTECTED] Sent: Thursday, June 02, 2005 3:19 PM To: java-user@luce

Lucene and Documentum

2005-06-02 Thread Andrew Boyd
Hi All, Has anyone had any experience using lucene to search a documentum respoitory? Thanks Andrew - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

calculate wi = tfi * IDFi for each document.

2005-06-02 Thread Andrew Boyd
If I have search results how can I calculate, using lucene's API, wi = tfi * IDFi for each document. wi= term weight tfi= term frequency in a document IDFi = inverse document frequency = log(D/dfi) dfi = document frequency or number of documents containing term i D= number of docum