Re: How can I tell Lucene to also use analyzer for Keyword fields

2006-06-13 Thread Chris Hostetter
: It seems anaylzers are never get called for UnTokenized fields(Seems no luck : either using PerFieldAnalyzer). The label "tokenized" is somewhat missleading .. it assumes that your analyzer will do some tokenizing (which it doesn't have to do in the case of the KeywordAnalyzer). The best thing

Re: Getting count on distinct values of a field.

2006-06-13 Thread heritrix . lucene
But what if that word is present in other fields also. does "docFreq " only look into that particular field ?? On 6/13/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: Look at the TermEnum class... iterate over the terms in your field, and docFreq is the number of docs with that term. : Date:

Index (speed) optimization

2006-06-13 Thread Trieschnigg, R.B. \(Dolf\)
Hi, I just looked at the log of my indexing program and saw that after adding 4.5 million documents (16 Gb of text) to a newly created index, it took 7 hours (!) to carry out the optimization (indexWriter.optimize()). I am running the indexing program on a (3.2Ghz, 1Gb RAM) desktop computer wit

about PrefixQuery Matching

2006-06-13 Thread Flik Shen
When I study PrefixQuery, I found a problem. For example search string: test(*) This could match testX, testX...X, but not to match test only. Is it real problem? CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended s

Re: question with spellchecker

2006-06-13 Thread mark harwood
For those with the luxury of a large store of historical queries it's interesting to note Google's approach to this. Not some fancy spell checker - just mining searcher behaviour patterns. Google's Bosworth describes this approach approx 13 minutes into this podcast: http://www.itconversations.c

RE: How can I tell Lucene to also use analyzer for Keyword fields

2006-06-13 Thread Ramana Jelda
Thanks for your replies. > -Original Message- > From: Chris Hostetter [mailto:[EMAIL PROTECTED] > Sent: Tuesday, June 13, 2006 9:13 AM > To: java-user@lucene.apache.org > Subject: Re: How can I tell Lucene to also use analyzer for > Keyword fields > > > : It seems anaylzers are never

RE: about PrefixQuery Matching

2006-06-13 Thread Mordo, Aviran (EXP N-NANNATEK)
The query should be test* The brackets will be eliminated by the analyzer Aviran http://www.aviransplace.com -Original Message- From: Flik Shen [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 13, 2006 6:07 AM To: java-user@lucene.apache.org Subject: about PrefixQuery Matching When I s

RE: Using more than one index

2006-06-13 Thread Mile Rosu
Hi Hoss, Thanks for your quick answer. One of the problems left with the date is this: A document (in our case an xml that has many metadata) can have more than one date, each date with 2 attributes: Eg: 00-00-1886 In the date index I have for every in the input xml a document with fields: t

How to use Query and TermQuery in a single file

2006-06-13 Thread Ramesh Salla
Hi, I am new to Lucene but feel quite comfortable using the API. I retrieve the Meta tags and the body from HTML files and their respective Title and Description from the database and then index documents. I use Query class to parse the search query. I get the results and I display the Title an

Re: Getting count on distinct values of a field.

2006-06-13 Thread Chris Hostetter
: But what if that word is present in other fields also. : does "docFreq " only look into that particular field ?? docFreq tells you the frequency of a term, a term is a field and a value -- if you want the counts of a value across multiple fields, you'll have to add them up yourself. (or make a

RE: Using more than one index

2006-06-13 Thread Chris Hostetter
: A document (in our case an xml that has many metadata) can have more : than one date, each date with 2 attributes: : 00-00-1886 : : In the date index I have for every in the input xml a document : with fields: type (document |other), date, art (birthday | deportation | : death...). For example

Re: Document design and analyzer questions?

2006-06-13 Thread Michael J. Prichard
Hey Chris, Thanks for the response. Chris Hostetter wrote: : Question is two fold. One, here is the layout I was thinking: my rule of thumb: if a field is going to contain less then a few dozen bytes (ie: a date, an email address, etc) you might as well store it ... it will make your life ea

Re: Document design and analyzer questions?

2006-06-13 Thread Chris Hostetter
: I will have millions of entries in my index. Would storing them cause : any performance issues? only testing will tell ... but generally speaking i don't think stored affect query performance very much -- just disk usage. : >another important thing you should consider is field norms: they don

JVM Crash

2006-06-13 Thread Ross Rankin
We keep getting JVM crashes on 1.4.3. I found in the archive that setting a JVM parameter solved the problem for a few users. We've tried that and it has not worked. Here's our JVM parameters: -Xms512m -Xmx1024m -XX:PermSize=256m We're running Tomcat 5.5.16. Any Idea? If it's an

Re: JVM Crash

2006-06-13 Thread N Hira
We had a similar problem. We discovered that it was basically that eden/from was out of memory and made two changes and that seems to have helped: 1. Reduce [Max]PermSize to 128M 2. Use the concurrent garbage collector Good luck. -h --- Ross Rankin <[EMAIL PROTECTED]> wrote: > We keep gettin

Re: question with spellchecker

2006-06-13 Thread Bob Carpenter
Very nice idea. This is the basis of most of the work on word-sense-disambiguation (e.g. is it "run" as in baseball, "run" as in stock, or "run" as in stocking? or is "John Smith" CEO of GM or "John Smith" lover of Pocahantas?). TF/IDF's not a bad way to compute this, either, though there are d

Re: JVM Crash

2006-06-13 Thread Bob Carpenter
Java apps shouldn't throw these kind of seg faults. Sounds like a problem with memory. Especially if you can't reproduce the error in the same location. Double especially if you have the same problems elsewhere under heavy memory load. I had all kinds of problems with seg faults in the JVM unt

Count occurrences of worths within a corpus.

2006-06-13 Thread Sergi Fernandez
Hi there, I'm new in Lucene, and I just know to index a corpus, and run a query. I thought I can count the times that a word appears in the whole corpus with a simple query, but it seems to be not so easy. Somebody knows how to do it? Many Thanks! Sergi Fernandez.

Lucene usage

2006-06-13 Thread Leandro Saad
Hi all. I'm writting a wrapper component around Lucene (using Avalon) and I'd like to know the common api usage. How should I bootstrap the index? Should I create the IndexSearcher when I initialize the component? For how long should I let the IndexWriter open? For one document: should I create

Re: Count occurrences of worths within a corpus.

2006-06-13 Thread Grant Ingersoll
Hi Sergi, Take a look at TermEnum and TermDocs in the API. You will have to iterate over these, summing as you go. You could also, during indexing, store these counts external to Lucene as you come across the term during the Analysis phase. Sergi Fernandez wrote: Hi there, I'm new in Luc

Re: JVM Crash

2006-06-13 Thread Dan Armbrust
Ross Rankin wrote: We keep getting JVM crashes on 1.4.3. I found in the archive that setting a JVM parameter solved the problem for a few users. We've tried that and it has not worked. Here's our JVM parameters: Why not try a new JVM? Either a newer sun... or a JDK, or a blackdown... In o

Re: JVM Crash

2006-06-13 Thread kieran
It may well be to do with this Hotspot bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6407471 Note, the bug only appears when you invoke java with the "-server" command line option. Kieran Dan Armbrust wrote: Ross Rankin wrote: We keep getting JVM crashes on 1.4.3. I found in t

Detecting index existance

2006-06-13 Thread Eduardo S. Cordeiro
Hi there, I'm just starting up with Lucene after reading bits and pieces from Gospodnetic and Hatcher's "Lucene in Action" (and noticing the API has changed for 2.0.0). My question is this: is there a way to detect whether or not the index exists? I'm currently developing a web application that

Re: Detecting index existance

2006-06-13 Thread kent.fitch
Try IndexReader static method indexExists: http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#indexExists(java.lang.String) Kent Fitch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional comm

Re: Detecting index existance

2006-06-13 Thread Erick Erickson
Well, I just tried it (opening an IndexSearcher) and got this exception... java.io.FileNotFoundException: C:\blank\segments (The system cannot find the file specified) The directory c:\blank exists, but is empty. So, it seems you can just catch the exception and infer that your admin users aren'

Use one or more indexes?

2006-06-13 Thread Liao Xuefeng
hi, I'm new to lucene. Now I want to add full-text search for my website to search articles, images and bbs topics. I'm not sure to use only one index to search all types of these, or create 3 indexes for each of type. If I use only one index, do I have to add a 'type' field to identify document

Re: Detecting index existance

2006-06-13 Thread Eduardo S. Cordeiro
Hi, Kent's suggestion worked (in fact, I had looked for such a method in other classes of the API -- forgot to look in IndexReader). It works just as expected :) Thanks again On 6/13/06, Erick Erickson <[EMAIL PROTECTED]> wrote: Well, I just tried it (opening an IndexSearcher) and got this ex

Re: Use one or more indexes?

2006-06-13 Thread wu fox
哥们: 这要看你打算如何组织你的索引了.多索引的情况下必须要考虑一个合并的问题 ,比如你要查找全文和标题就必须涉及到两个索引的搜索结果,那么你按照什么来合并呢?还有,自己合并结果是一个愚蠢的想法,你必须让lucene替你合并, 这是由于算法的速度决定的.这是多索引最主要的问题,如何去合并各个分区的结果.如果是单分区 ,当然你可以把所有相关的东西都放在一个document里,搜索是没有问题的,难度在于"更新",lucene是没有更新操作的,他会先删除doc,再重现添加,如果doc里比较复杂的话你就需要重新去做关于这个doc的索引,如果还涉及到抽取全文, 这个过程需要的时间可就大发了.比如用

Re: Getting count on distinct values of a field.

2006-06-13 Thread heritrix . lucene
I am sorry for my stupid question. Thanks. :-) Regards, On 6/13/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : But what if that word is present in other fields also. : does "docFreq " only look into that particular field ?? docFreq tells you the frequency of a term, a term is a field a