Re: Term frequency

2007-04-12 Thread sai hariharan
Hi, Thanx for replying. In my scenario i'm not going to index any of my docs. So is there a way to find out term frequencies of the terms in a doc without doing the indexing part? Thanx in advance, Hari On 4/12/07, Grant Ingersoll [EMAIL PROTECTED] wrote: Add Term Vectors to your Field during

Re: Term frequency

2007-04-12 Thread karl wettin
12 apr 2007 kl. 09.12 skrev sai hariharan: Thanx for replying. In my scenario i'm not going to index any of my docs. So is there a way to find out term frequencies of the terms in a doc without doing the indexing part? Using an analyzer (Tokenstream) and a MapString, Integer? while ((t =

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 11 Apr 2007 at 18:05, Erick Erickson wrote: Rather than using a search, have you thought about using a TermEnum? It's much, much, much faster than a query. What it allows you to do is enumerate the terms in the index on a per-field basis. Essentially, this is what happens when you do a

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 12 Apr 2007 at 7:13, Antony Bowesman wrote: Steffen Heinrich wrote: Normally an IndexWriter uses only one default Analyzer for all its tokenizing businesses. And while it is appearantly possible to supply a certain other instance when adding a specific document there seems to be no

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 12 Apr 2007 at 0:28, karl wettin wrote: 11 apr 2007 kl. 22.32 skrev Steffen Heinrich: According to occasional references on this list some people have already tried to implement such a search with lucene but did they succeed? My first idea was to run every completed token of the

Are KeywordAnalyser and Field.Index.UN_TOKENIZED synonymous.

2007-04-12 Thread Paul Taylor
Are KeywordAnalyser and Field.Index.UN_TOKENIZED synonymous. i.e if I create an IndexWriter with a KeywordAnalyser does it make any difference whether I index my fields within documents added to this index with Field.Index.UN_TOKENIZED or Field.Index.TOKENIZED thanks paul

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Erick Erickson
See below On 4/12/07, Steffen Heinrich [EMAIL PROTECTED] wrote: On 11 Apr 2007 at 18:05, Erick Erickson wrote: Rather than using a search, have you thought about using a TermEnum? It's much, much, much faster than a query. What it allows you to do is enumerate the terms in the index on

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread karl wettin
12 apr 2007 kl. 12.19 skrev Steffen Heinrich: The intended system however can not be trained by user input. The suggestions have to come from a given corpus (e.g. an ocasionally updated product database). Do you think adopting your package to set up the tries from a corpus would be fairly

RE: How to update index dynamically

2007-04-12 Thread Tony Qian
You have to refresh your IndexSearcher periodically. Tony From: anson [EMAIL PROTECTED] Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: How to update index dynamically Date: Mon, 09 Apr 2007 18:25:57 +0900 I have build a blog project under tomcat5.5 with Lucene2.0. And I want to

Index performance

2007-04-12 Thread Tony Qian
All, Sorry for long email. I have two questions on indexing. My data consists of an id, short headline and story text. Story text has some html tags. Here is an example. In early 2005, it seemed that Shamita Shetty had finally arrived after a high profile debut in iMohabbatein/i [2000]. br

Re: Are KeywordAnalyser and Field.Index.UN_TOKENIZED synonymous.

2007-04-12 Thread Otis Gospodnetic
I think you are right. But for sanity, if you really want the field to be untokenized, use F.I.UN_TOKENIZED. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Paul Taylor [EMAIL

filename search with lucene

2007-04-12 Thread Hari Krishna Bikmal
I am trying to make a ftp search engine for searching filenames(not the content). I am thinking of using apache net commons for acessing ftp servers and want to implement the indexing,searching part using lucene. Can anyone tell me how to use lucene in this context.

Re: strange idf in Lucene 2.1

2007-04-12 Thread Bill Janssen
docfreqs (idfs) do not take into account deleted docs. This is more of an engineering tradeoff rather than a feature. If we could cheaply and easily update idfs when documents are deleted from an index, we would. Wow. So is it fair to say that the stored IDF is really the cumulative IDF for

Re: Index performance

2007-04-12 Thread Otis Gospodnetic
Hi Tony, Your code looks fine to me. I'm not sure what you timed - the whole app run, just indexing, indexing + optimizing... If you times indexing + optimizing, leave optimization out of the timer. How long do you think this should take? Try setting maxBufferedDocs to 90. Otis . . . . .

Re: strange idf in Lucene 2.1

2007-04-12 Thread Yonik Seeley
On 4/12/07, Bill Janssen [EMAIL PROTECTED] wrote: docfreqs (idfs) do not take into account deleted docs. This is more of an engineering tradeoff rather than a feature. If we could cheaply and easily update idfs when documents are deleted from an index, we would. Wow. So is it fair to say

Re: Index performance

2007-04-12 Thread Erick Erickson
Another question is if I can delete document based on storyIndentity field ( using IndexReader.deleteDocuments(term)). Since storyIdentity field is not indexed, is there any performance issue or I should index it too (and store it)? As to your very last question, No, there'll be no

Re: Issue with search() Help Appreciated.

2007-04-12 Thread Lokeya
The issue is solved. Luke was very helpful in debugging, infact it helped to identify a very basic mistake we were making. Lokeya wrote: I solved the issue by using: 1.Same Analyser. 2.Making indexing by tokenizing terms. Now issue with the following code is, I am facing issues which

Basic Question in Lucene Indexing.

2007-04-12 Thread Lokeya
I have one million records to index, each of which have Tiltle, Desciption and Identifier. If take each document and try to index these fields my program was very slow. So I took 100,000 records and get the value of these fields, add them to the addDocument() method. Then I use the Index writer

Re: Basic Question in Lucene Indexing.

2007-04-12 Thread Erick Erickson
Don't do that G Why are you trying to open the index 700,000 times? During indexing or searching? In either case, there's no reason to. You should be able to open the index and keep it open as long as you want. I still don't understand why you can't index the records individually, but I'll

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 12 Apr 2007 at 9:27, Erick Erickson wrote: See below ... Not quite. As I understand your problem, you want all the terms that match (or at least a subset) for a field. For this, WildcardTermEnum is really all you need. Think of it this way... (Wildcard)TermEnum gives you a list of all

Re: strange idf in Lucene 2.1

2007-04-12 Thread Bill Janssen
The difference between IndexReader.maxDoc() and numDocs() tells you how many documents have been marked for deletion but still take up space in the index. But not which terms have an odd IDF value because of those deleted documents. How much does the IDF value contribute to the score in

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread karl wettin
12 apr 2007 kl. 20.00 skrev Steffen Heinrich: This search is only meant to be used in an ajax-driven web application. And the basic idea is to give the user incentive and turn him to something new, something he didn't think of before. I just generalized on the concept in a mail to Erick under

Re: Term frequency

2007-04-12 Thread Doron Cohen
karl wettin [EMAIL PROTECTED] wrote on 12/04/2007 00:25:47: 12 apr 2007 kl. 09.12 skrev sai hariharan: Thanx for replying. In my scenario i'm not going to index any of my docs. So is there a way to find out term frequencies of the terms in a doc without doing the indexing part? Using

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 12 Apr 2007 at 20:22, karl wettin wrote: 12 apr 2007 kl. 20.00 skrev Steffen Heinrich: This search is only meant to be used in an ajax-driven web application. And the basic idea is to give the user incentive and turn him to something new, something he didn't think of before. I

Re: Index performance

2007-04-12 Thread Doron Cohen
To cover all possible non-indexing overhead, better measure with something like this: static long indexContents(IndexWriter writer, List storyContentList) throws IOException { long res = 0; if (storyContentList != null storyContentList.size() != 0) { try {

Re: strange idf in Lucene 2.1

2007-04-12 Thread Chris Hostetter
: But not which terms have an odd IDF value because of those deleted : documents. How much does the IDF value contribute to the score in : search? all idf's are affected equally, because the 'numDocs value used is allways the same ... it really shouldn't affect the scores from a query, it just

Re: strange idf in Lucene 2.1

2007-04-12 Thread Chris Hostetter
: This should be the same for Lucene 2.0 and 2.1. : : I understand. But I think we could well come accross this issue : with Lucene 2.1 than 2.0? i'm not understanding this part of the thread ... are you saying that if you have two identical setups, the only difference being that one uses 2.0

Re: strange idf in Lucene 2.1

2007-04-12 Thread Yonik Seeley
On 4/12/07, Chris Hostetter [EMAIL PROTECTED] wrote: : But not which terms have an odd IDF value because of those deleted : documents. How much does the IDF value contribute to the score in : search? all idf's are affected equally, because the 'numDocs value used is allways the same There

Re: strange idf in Lucene 2.1

2007-04-12 Thread Doron Cohen
Chris Hostetter [EMAIL PROTECTED] wrote on 12/04/2007 15:22:20: : But not which terms have an odd IDF value because of those deleted : documents. How much does the IDF value contribute to the score in : search? all idf's are affected equally, because the 'numDocs value used is allways the

Re: strange idf in Lucene 2.1

2007-04-12 Thread Chris Hostetter
: But if now the index goes through a massive update, where almost all the : docs containing TC are deleted, and TC is not in any newly added doc, : practically TC becomes rare too, and hence D2 should probably be scored : higher than D1. But IDF(TC) might not (yet) reflect the massive docs :

Re: strange idf in Lucene 2.1

2007-04-12 Thread Koji Sekiguchi
Chris, i'm not understanding this part of the thread ... are you saying that if you have two identical setups, the only difference being that one uses 2.0 and the other uses 2.1, then you see different idfs after adding/deleting/re-adding many docs? Exactly. Please try to run the program

Re: strange idf in Lucene 2.1

2007-04-12 Thread Yonik Seeley
On 4/12/07, Koji Sekiguchi [EMAIL PROTECTED] wrote: Chris, i'm not understanding this part of the thread ... are you saying that if you have two identical setups, the only difference being that one uses 2.0 and the other uses 2.1, then you see different idfs after adding/deleting/re-adding

I have a question about phrase query with stop words

2007-04-12 Thread Bill Taylor
I found some discussions of this question from back in 2003, but that was many updates ago. I have built an index using the standard stop analyser which uses the standard list of stop words. will and :the are stop words. As I understand analyzers and phrase queries, when I search for you will

Re: strange idf in Lucene 2.1

2007-04-12 Thread Koji Sekiguchi
Is the index completely removed between the 2.0 and 2.1 runs? Sure. If you see my program, you'll find I'm using RAMDirectory. regards, Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail:

Re: strange idf in Lucene 2.1

2007-04-12 Thread Yonik Seeley
On 4/12/07, Koji Sekiguchi [EMAIL PROTECTED] wrote: Is the index completely removed between the 2.0 and 2.1 runs? Sure. If you see my program, you'll find I'm using RAMDirectory. OK, I think it's due to the change in merge policy. Lucene 2.0 could under-merge (not enough) or over-merge

custom stop word list for standard analyzer

2007-04-12 Thread Michael Barbarelli
I know this is a relatively fundamental thing to arrange, but I'm having trouble. Can I instantiate a standard analyzer with an argument containing my own stop words? If so, how? Will they be appended to or override the built-in stop words? Or, do I have to modify the analyzer class itself

Re: custom stop word list for standard analyzer

2007-04-12 Thread Paul Cowan
Michael Barbarelli wrote: Can I instantiate a standard analyzer with an argument containing my own stop words? If so, how? Will they be appended to or override the built-in stop words? You can do it with one of the alternate constructors, and they'll override the build-in list. ---

Re: custom stop word list for standard analyzer

2007-04-12 Thread Chris Hostetter
: Michael Barbarelli wrote: : Can I instantiate a standard analyzer with an argument containing my own : stop words? If so, how? Will they be appended to or override the built-in I'm relly suprised how often this question gets asked ... Michael (or anyone else for that matter) do you have