Re: custom stop word list for standard analyzer

2007-04-12 Thread Chris Hostetter
: Michael Barbarelli wrote: : > Can I instantiate a standard analyzer with an argument containing my own : > stop words? If so, how? Will they be appended to or override the built-in I'm relly suprised how often this question gets asked ... Michael (or anyone else for that matter) do you have a

Re: custom stop word list for standard analyzer

2007-04-12 Thread Paul Cowan
Michael Barbarelli wrote: Can I instantiate a standard analyzer with an argument containing my own stop words? If so, how? Will they be appended to or override the built-in stop words? You can do it with one of the alternate constructors, and they'll override the build-in list. --- String

custom stop word list for standard analyzer

2007-04-12 Thread Michael Barbarelli
I know this is a relatively fundamental thing to arrange, but I'm having trouble. Can I instantiate a standard analyzer with an argument containing my own stop words? If so, how? Will they be appended to or override the built-in stop words? Or, do I have to modify the analyzer class itself and

Re: strange idf in Lucene 2.1

2007-04-12 Thread Yonik Seeley
On 4/12/07, Koji Sekiguchi <[EMAIL PROTECTED]> wrote: > Is the index completely removed between the 2.0 and 2.1 runs? Sure. If you see my program, you'll find I'm using RAMDirectory. OK, I think it's due to the change in merge policy. Lucene 2.0 could under-merge (not enough) or over-merge (b

Re: strange idf in Lucene 2.1

2007-04-12 Thread Koji Sekiguchi
> Is the index completely removed between the 2.0 and 2.1 runs? Sure. If you see my program, you'll find I'm using RAMDirectory. regards, Koji - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMA

Re: I have a question about phrase query with stop words

2007-04-12 Thread Erick Erickson
As I understand it, there really is no "space indicator". I think of it as replacing the stop word with a space, which is then discarded. so, you're indexing 'you find answer', and both your searches are looking for 'you find answer', the stop words are just gone as though they never were. So bo

I have a question about phrase query with stop words

2007-04-12 Thread Bill Taylor
I found some discussions of this question from back in 2003, but that was many updates ago. I have built an index using the standard stop analyser which uses the standard list of stop words. "will" and :the" are stop words. As I understand analyzers and phrase queries, when I search for you wi

Re: strange idf in Lucene 2.1

2007-04-12 Thread Yonik Seeley
On 4/12/07, Koji Sekiguchi <[EMAIL PROTECTED]> wrote: Chris, > i'm not understanding this part of the thread ... are you saying that if > you have two identical setups, the only difference being that one uses 2.0 > and the other uses 2.1, then you see different idfs after > adding/deleting/re-add

Re: strange idf in Lucene 2.1

2007-04-12 Thread Koji Sekiguchi
Chris, i'm not understanding this part of the thread ... are you saying that if you have two identical setups, the only difference being that one uses 2.0 and the other uses 2.1, then you see different idfs after adding/deleting/re-adding many docs? Exactly. Please try to run the program whic

Re: strange idf in Lucene 2.1

2007-04-12 Thread Chris Hostetter
: But if now the index goes through a massive update, where almost all the : docs containing TC are deleted, and TC is not in any newly added doc, : practically TC becomes rare too, and hence D2 should probably be scored : higher than D1. But IDF(TC) might not (yet) reflect the massive docs : dele

Re: strange idf in Lucene 2.1

2007-04-12 Thread Doron Cohen
Chris Hostetter <[EMAIL PROTECTED]> wrote on 12/04/2007 15:22:20: > > : But not which terms have an odd IDF value because of those deleted > : documents. How much does the IDF value contribute to the "score" in > : search? > > all idf's are affected equally, because the 'numDocs" value used is >

Re: strange idf in Lucene 2.1

2007-04-12 Thread Yonik Seeley
On 4/12/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : But not which terms have an odd IDF value because of those deleted : documents. How much does the IDF value contribute to the "score" in : search? all idf's are affected equally, because the 'numDocs" value used is allways the same The

Re: strange idf in Lucene 2.1

2007-04-12 Thread Chris Hostetter
: > This should be the same for Lucene 2.0 and 2.1. : : I understand. But I think we could well come accross this issue : with Lucene 2.1 than 2.0? i'm not understanding this part of the thread ... are you saying that if you have two identical setups, the only difference being that one uses 2.0

Re: strange idf in Lucene 2.1

2007-04-12 Thread Chris Hostetter
: But not which terms have an odd IDF value because of those deleted : documents. How much does the IDF value contribute to the "score" in : search? all idf's are affected equally, because the 'numDocs" value used is allways the same ... it really shouldn't affect the scores from a query, it jus

Re: Sorting on a field that can have null values

2007-04-12 Thread Chris Hostetter
: If i rememebr correctly (you'll have to test this) sorting on a field : which doesn't exist for every doc does what you would want (docs with : values are listed before docs without) : The actual behavior is different than described above. I modified : TestSort.java: : The actual order of the

Re: Index performance

2007-04-12 Thread Doron Cohen
To cover all possible non-indexing overhead, better measure with something like this: static long indexContents(IndexWriter writer, List storyContentList) throws IOException { long res = 0; if (storyContentList != null && storyContentList.size() != 0) { try {

Re: Index performance

2007-04-12 Thread Erick Erickson
Inferring out on the end of a long and fragile limb. Do you get information from the database in any of the calls in your indexing loop? That is, do any of... itr.next(); content.getStoryText(), content.getStoryIdentity() content.getHeadline1() go out to the DB to get info, and could

Re: Index performance

2007-04-12 Thread Doron Cohen
> I tried to index it. It took from 7-10 seconds to index about 90 documents. That would be around 10 documents per second - way too slow. A Lucene's perf test adding 12,000 docs sized similar to your sample doc (1400 characters) on a not so strong machine shows much faster pace - 146 docs per sec

Sorting on a field that can have null values

2007-04-12 Thread Peter Keegan
I'm copying this reply from a topic with the same title from the defunct 'lucene-user' list. My comments follow it. : I thought of putting empty strings instead of null values but I think : empty strings are put first in the list while sorting which is the : reverse of what anyone would want. in

Re: Index performance

2007-04-12 Thread Tony Qian
Otis, I timed just for indexing. thanks, Tony From: Otis Gospodnetic <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Re: Index performance Date: Thu, 12 Apr 2007 09:31:49 -0700 (PDT) Hi Tony, Your code looks fine to me. I'm not sure what you timed - the whole a

Re: Index performance

2007-04-12 Thread Tony Qian
Eric, Thanks for the information. The id is generated by database and it is unique. So I only need to index it and don't need to store it, right Tony From: "Erick Erickson" <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Re: Index performance Date: Thu, 12 Apr

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 12 Apr 2007 at 20:22, karl wettin wrote: > > 12 apr 2007 kl. 20.00 skrev Steffen Heinrich: > > > This search is only meant to be used in an ajax-driven web > > application. > > And the basic idea is to give the user incentive and turn him to > > something new, something he didn't think of bef

Re: Term frequency

2007-04-12 Thread Doron Cohen
karl wettin <[EMAIL PROTECTED]> wrote on 12/04/2007 00:25:47: > > 12 apr 2007 kl. 09.12 skrev sai hariharan: > > > Thanx for replying. In my scenario i'm not going to index any of my > > docs. > > So is there a way to find out term frequencies of the terms in a doc > > without doing the indexing p

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread karl wettin
12 apr 2007 kl. 20.00 skrev Steffen Heinrich: This search is only meant to be used in an ajax-driven web application. And the basic idea is to give the user incentive and turn him to something new, something he didn't think of before. I just generalized on the concept in a mail to Erick under t

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 12 Apr 2007 at 16:49, karl wettin wrote: > > 12 apr 2007 kl. 12.19 skrev Steffen Heinrich: > > > > > The intended system however can not be trained by user input. The > > suggestions have to come from a given corpus (e.g. an ocasionally > > updated product database). > > Do you think adopting

Re: strange idf in Lucene 2.1

2007-04-12 Thread Bill Janssen
> The difference between IndexReader.maxDoc() and numDocs() tells you > how many documents have been marked for deletion but still take up > space in the index. But not which terms have an odd IDF value because of those deleted documents. How much does the IDF value contribute to the "score" in s

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 12 Apr 2007 at 9:27, Erick Erickson wrote: > See below ... > Not quite. As I understand your problem, you want all the terms that > match (or at least a subset) for a field. For this, WildcardTermEnum > is really all you need. Think of it this way... > (Wildcard)TermEnum gives you a list of

Re: Basic Question in Lucene Indexing.

2007-04-12 Thread Erick Erickson
Don't do that Why are you trying to open the index 700,000 times? During indexing or searching? In either case, there's no reason to. You should be able to open the index and keep it open as long as you want. I still don't understand why you can't index the records individually, but I'll assume

Re: Issue with : Searcher.search() returning Hits of same length for different searches

2007-04-12 Thread Lokeya
Thanks for your suggestion. I used Luke to debug and found the issue. I have one million records to index, each of which have "Tiltle", "Desciption" and "Identifier". If take each document and try to index these fields my program was very slow. So I took 100,000 records and get the value of these

Basic Question in Lucene Indexing.

2007-04-12 Thread Lokeya
I have one million records to index, each of which have "Tiltle", "Desciption" and "Identifier". If take each document and try to index these fields my program was very slow. So I took 100,000 records and get the value of these fields, add them to the addDocument() method. Then I use the Index wri

Re: Issue with search() Help Appreciated.

2007-04-12 Thread Lokeya
The issue is solved. Luke was very helpful in debugging, infact it helped to identify a very basic mistake we were making. Lokeya wrote: > > I solved the issue by using: > > 1.Same Analyser. > 2.Making indexing by tokenizing terms. > > Now issue with the following code is, I am facing issues

Re: Index performance

2007-04-12 Thread Erick Erickson
Another question is if I can delete document based on storyIndentity field ( using IndexReader.deleteDocuments(term)). Since storyIdentity field is not indexed, is there any performance issue or I should index it too (and store it)? As to your very last question, No, there'll be no performance

Re: strange idf in Lucene 2.1

2007-04-12 Thread Yonik Seeley
On 4/12/07, Bill Janssen <[EMAIL PROTECTED]> wrote: > docfreqs (idfs) do not take into account deleted docs. > This is more of an engineering tradeoff rather than a feature. > If we could cheaply and easily update idfs when documents are deleted > from an index, we would. Wow. So is it fair to

Re: Index performance

2007-04-12 Thread Otis Gospodnetic
Hi Tony, Your code looks fine to me. I'm not sure what you timed - the whole app run, just indexing, indexing + optimizing... If you times indexing + optimizing, leave optimization out of the timer. How long do you think this should take? Try setting maxBufferedDocs to 90. Otis . . . . .

Re: strange idf in Lucene 2.1

2007-04-12 Thread Bill Janssen
> docfreqs (idfs) do not take into account deleted docs. > This is more of an engineering tradeoff rather than a feature. > If we could cheaply and easily update idfs when documents are deleted > from an index, we would. Wow. So is it fair to say that the stored IDF is really the cumulative IDF f

filename search with lucene

2007-04-12 Thread Hari Krishna Bikmal
I am trying to make a ftp search engine for searching filenames(not the content). I am thinking of using apache net commons for acessing ftp servers and want to implement the indexing,searching part using lucene. Can anyone tell me how to use lucene in this context.

Re: Are KeywordAnalyser and Field.Index.UN_TOKENIZED synonymous.

2007-04-12 Thread Otis Gospodnetic
I think you are right. But for sanity, if you really want the field to be untokenized, use F.I.UN_TOKENIZED. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Paul Taylor <[EMAIL PROTEC

Index performance

2007-04-12 Thread Tony Qian
All, Sorry for long email. I have two questions on indexing. My data consists of an id, short headline and story text. Story text has some html tags. Here is an example. In early 2005, it seemed that Shamita Shetty had finally arrived after a high profile debut in Mohabbatein [2000]. With 3

RE: How to update index dynamically

2007-04-12 Thread Tony Qian
You have to refresh your IndexSearcher periodically. Tony From: anson <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: How to update index dynamically Date: Mon, 09 Apr 2007 18:25:57 +0900 I have build a blog project under tomcat5.5 with Lucene2.0. And I want to

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread karl wettin
12 apr 2007 kl. 12.19 skrev Steffen Heinrich: The intended system however can not be trained by user input. The suggestions have to come from a given corpus (e.g. an ocasionally updated product database). Do you think adopting your package to set up the tries from a corpus would be fairly easy

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Erick Erickson
See below On 4/12/07, Steffen Heinrich <[EMAIL PROTECTED]> wrote: On 11 Apr 2007 at 18:05, Erick Erickson wrote: > Rather than using a search, have you thought about using a TermEnum? > It's much, much, much faster than a query. What it allows you to do > is enumerate the terms in the index

Are KeywordAnalyser and Field.Index.UN_TOKENIZED synonymous.

2007-04-12 Thread Paul Taylor
Are KeywordAnalyser and Field.Index.UN_TOKENIZED synonymous. i.e if I create an IndexWriter with a KeywordAnalyser does it make any difference whether I index my fields within documents added to this index with Field.Index.UN_TOKENIZED or Field.Index.TOKENIZED thanks paul --

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 12 Apr 2007 at 0:28, karl wettin wrote: > > 11 apr 2007 kl. 22.32 skrev Steffen Heinrich: > > > According to occasional references on this list some people have > > already tried to implement such a search with lucene but did they > > succeed? > > > > My first idea was to run every completed t

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 12 Apr 2007 at 7:13, Antony Bowesman wrote: > Steffen Heinrich wrote: > > Normally an IndexWriter uses only one default Analyzer for all its > > tokenizing businesses. And while it is appearantly possible to supply > > a certain other instance when adding a specific document there seems > >

Re: Turning PrefixQuery into a TermQuery

2007-04-12 Thread Steffen Heinrich
On 11 Apr 2007 at 18:05, Erick Erickson wrote: > Rather than using a search, have you thought about using a TermEnum? > It's much, much, much faster than a query. What it allows you to do > is enumerate the terms in the index on a per-field basis. Essentially, this > is what happens when you do a P

Re: Term frequency

2007-04-12 Thread karl wettin
12 apr 2007 kl. 09.12 skrev sai hariharan: Thanx for replying. In my scenario i'm not going to index any of my docs. So is there a way to find out term frequencies of the terms in a doc without doing the indexing part? Using an analyzer (Tokenstream) and a Map? while ((t = ts.next)!=null)

Re: Term frequency

2007-04-12 Thread sai hariharan
Hi, Thanx for replying. In my scenario i'm not going to index any of my docs. So is there a way to find out term frequencies of the terms in a doc without doing the indexing part? Thanx in advance, Hari On 4/12/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Add Term Vectors to your Field durin