Lucene 1.3 final to 1.4final problem

2004-07-07 Thread Karthik N S
Hey Dev Guys Apologies Can Some body Explain me Why for an I/P word "TA" to the StopAnalyzer.java returns [ta] instead of [ta] "TA" ==> [ta] instead of [ta] "$125.96 === [125.95] instead of [$125.95] Is it something wrong I have been missing. with r

Lucene 1.3 final to 1.4final problem

2004-07-07 Thread Karthik N S
Hey Dev Guys Apologies I have a Quick Problem... The no of Hits on set of Documents indexed using 1.3-final is not same on 1.4-final version [ The only modification done to the src is , I have upgraded my CustomAnalyzer on basis of StopAnalyzer avaliable in 1.4 ] Does doing this

Re: unicode-compatible

2004-07-07 Thread shafipour elnaz
I want to search farsi pages so I need a way to index farsi pages.One told me to chang tokenizer .I do this but it doesn't work. Ofcourse if there is one I prefere to use thet but i didn't find any yet. Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Moving to lucene-user list. Persian = Farsi? Wha

RE: Search & Hit Score

2004-07-07 Thread Karthik N S
Hey Dev Guys Apologies Can some body Explain me How to Retrieve All hits avaliable per indexed document. To explain in Detail A Physical Search on Single document would list 3 places for a certain word occurance, So if i am suppose to retrieve all the 3 Occurance

Re: unicode-compatible

2004-07-07 Thread Otis Gospodnetic
Moving to lucene-user list. Persian = Farsi? What you would need is a Farsi Analyzer, and Lucene does not come with one, unfortunately. You'll likely have to write it yourself, or find an existing one. Otis --- shafipour elnaz <[EMAIL PROTECTED]> wrote: > I want to make it to be compatible wit

Re: Lucene shouldn't use java.io.tmpdir

2004-07-07 Thread Otis Gospodnetic
Hey Kevin, Not sure if you're aware of it, but you can specify the lock dir, so in your example, both JVMs could use the exact same lock dir, as long as you invoke the VMs with the same params. You shouldn't be writing the same index with more than 1 IndexWriter though (not sure if this was just

Lucene shouldn't use java.io.tmpdir

2004-07-07 Thread Kevin A. Burton
As per 1.3 (or was it 1.4) Lucene migrated to using java.iot.tmpdir to store the locks for the index. While under most situations this is save a lot of application servers change java.io.tmpdir at runtime. Tomcat is a good example. Within Tomcat this property is set to TOMCAT_HOME/temp.. Und

Re: upgrade from Lucene 1.3 final to 1.4rc3 problem

2004-07-07 Thread Alex Aw Seat Kiong
Hi! Thanks, the problem was sovled by using lucene1.4 final. Regards, AlexAw - Original Message - From: "Zilverline info" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, July 07, 2004 10:32 PM Subject: Re: upgrade from Lucene 1.3 final to 1.4rc3 problem

Re: Deleting a Doc found via a Query

2004-07-07 Thread Bill Tschumy
Thanks. This works fine. I guess I was missing something . I would have expected this to be a property of Document. On Jul 7, 2004, at 8:49 PM, Peter M Cipollone wrote: Bill, Check http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ Hits.html#id(int) Pete - Original Mess

Re: Deleting a Doc found via a Query

2004-07-07 Thread Peter M Cipollone
Bill, Check http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Hits.html#id(int) Pete - Original Message - From: "Bill Tschumy" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, July 07, 2004 9:46 PM Subject: Deleting a Doc found via a Quer

Deleting a Doc found via a Query

2004-07-07 Thread Bill Tschumy
I must be missing something here, but I can't see an easy way to delete a Document that has been found via searching. The delete() method of IndexReader takes a docNum. How do I get the docNum corresponding to the Document in the Hits? I tried scanning through all the Documents using IndexRea

Re: PhraseQuery with Wildcards?

2004-07-07 Thread Erik Hatcher
On Jul 7, 2004, at 6:24 PM, [EMAIL PROTECTED] wrote: Hi, Is there any way to do a PhraseQuery with Wildcards? No. This very question came up a few days ago. Look at PhrasePrefixQuery - although this will be a bit of effort to expand the terms matching the wildcarded term. I'd like to search for

Re: indexing help

2004-07-07 Thread John Wang
Hi Doug: Thanks for the response! The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. Given many documents with many term

PhraseQuery with Wildcards?

2004-07-07 Thread yahootintin . 1247688
Hi, Is there any way to do a PhraseQuery with Wildcards? I'd like to search for: MyField:"foo bar*" I thought I could cobble something together using PhraseQuery and Wildcards but I couldn't get this functionality to work due to my lack of experience with Lucene. Is there a way to do

Re: indexing help

2004-07-07 Thread Doug Cutting
John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it ca

RE: Problem with match on a non tokenized field.

2004-07-07 Thread wallen
Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper Here is how I use it: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer("url", new NullAnalyzer()); try

Problem with match on a non tokenized field.

2004-07-07 Thread Polina Litvak
I have a Lucene Document with a field named Code which is stored and indexed but not tokenized. The value of the field is ABC5-LB. The only way I can match the field when searching is by entering Code:"ABC5-LB" because when I drop the quotes, every Analyzer I've tried using breaks my query into C

Re: Searching for asterisk in a term

2004-07-07 Thread Erik Hatcher
On Jul 7, 2004, at 3:41 PM, [EMAIL PROTECTED] wrote: Can you recommend an analyzer that doesn't discard '*' or '/'? WhitespaceAnalyzer :) Check the wiki AnalysisParalysis page also. Erik - To unsubscribe, e-mail: [EMAIL PRO

Re: Searching for asterisk in a term

2004-07-07 Thread yahootintin . 1247688
Can you recommend an analyzer that doesn't discard '*' or '/'? --- Lucene Users List" <[EMAIL PROTECTED] wrote: The first thing you'll want to check is that you are using an Analyzer > that does not discard that '*' before indexing. StandardAnalyzer, for > instance, will discard it. Check o

Re: Searching for asterisk in a term

2004-07-07 Thread Otis Gospodnetic
The first thing you'll want to check is that you are using an Analyzer that does not discard that '*' before indexing. StandardAnalyzer, for instance, will discard it. Check one of Erik Hatcher's articles that includes a tool that helps you see what your Analyzer does with the any given text inpu

indexing help

2004-07-07 Thread John Wang
Hi gurus: I am trying to be able to control the indexing process. While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), l

Searching for asterisk in a term

2004-07-07 Thread yahootintin . 1247688
Hi, I'm trying to search for a term that contains an asterisk. This is the field that I indexed: - new Field("testField", "Hello *foo bar", true, true, true); I'm trying to find this document by matching '*foo': - new TermQuery(new Term("testField", "*me")); I've also tried to escap

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Kevin A. Burton
Doug Cutting wrote: Julien, Thanks for the excellent explanation. I think this thread points to a documentation problem. We should improve the javadoc for these parameters to make it easier for folks to In particular, the javadoc for mergeFactor should mention that very large values (>100) are n

Re: Most efficient way to index 14M documents (out of memory/file

2004-07-07 Thread markharw00d
Would it make more sense to use a parameter defining RAM size for the cache rather than minMergeDocs? Tuning RAM usage is the real issue here and controlling this by guessing the number of docs you can squeeze into RAM is not the most helpful approach. How about a "setMaxCacheSize(int megabytes

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
Julien, Thanks for the excellent explanation. I think this thread points to a documentation problem. We should improve the javadoc for these parameters to make it easier for folks to In particular, the javadoc for mergeFactor should mention that very large values (>100) are not recommended, sin

addIndexes and optimize

2004-07-07 Thread roy-lucene-user
Hey y'all again, Just wondering why the IndexWriter.addIndexes method calls optimize before and after it starts merging segments together. We would like to create an addIndexes method that doesn't optimize and call optimize on the IndexWriter later. Roy. --

Re: addIndexes vs addDocument

2004-07-07 Thread roy-lucene-user
Otis, Okay, got it... however we weren't creating new document objects... just grabbing a document through an IndexReader and calling addDocument on another index. Would that still work with unstored fields(well, its working for us since we don't have any unstored fields)? Thanks a lot! Roy. O

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Julien Nioche
It is not surprising that you run out of file handles with such a large mergeFactor. Before trying more complex strategies involving RAMDirectories and/or splitting your indexation on several machines, I reckon you should try simple things like using a low mergeFactor (eg: 10) combined with a high

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
A mergeFactor of 5000 is a bad idea. If you want to index faster, try increasing minMergeDocs instead. If you have lots of memory this can probably be 5000 or higher. Also, why do you optimize before you're done? That only slows things. Perhaps you have to do it because you've set mergeFacto

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Harald Kirsch
On Tue, Jul 06, 2004 at 10:44:40PM -0700, Kevin A. Burton wrote: > I'm trying to burn an index of 14M documents. > > I have two problems. > > 1. I have to run optimize() every 50k documents or I run out of file > handles. this takes TIME and of course is linear to the size of the > index so i

RE: Search & Hit Score

2004-07-07 Thread Karthik N S
Hey Ype Apologies . I would be more intrested in Boost/Weight factor in terms of Query rather then Fields. Please explain with example src. With regards Karthik -Original Message- From: Ype Kingma [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 12:08 PM To: [EMAI

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: A colleague of mine found the fastest way to index was to use a RAMDirectory, letting it grow to a pre-defined maximum size, then merging it to a new temporary file-based index to flush it. Repeat this, creating new directories for all the file based indexes then perform a

Re: upgrade from Lucene 1.3 final to 1.4rc3 problem

2004-07-07 Thread Zilverline info
This is a bug (see posting 'Lockfile Problem Solved'), upgrade to 1.4-final, and you'll be fine Alex Aw Seat Kiong wrote: Hi! I'm using Lucene 1.3 final currently, all things were working fine. But, after i'm upgraded from Lucene 1.3 final to 1.4rc3 (simply overwrite the lucene-1.4-final.jar to

Re: boolean operators and score

2004-07-07 Thread Don Vaillancourt
I think that the only way to resolve this would be to order your keywords alphabetically to control the result every single time prior to submitting your search to Lucene. I don't know if Lucene does this, but I'm fairly sure that sorting the criteria would be a complex matter. At 09:05 AM 07/

MultiSearcher is very slow

2004-07-07 Thread Don Vaillancourt
Hi all, I've managed to add multi-index searching capability to my code. But one thing that I have noticed is that Lucene is extremely slow in searching. For example I have been testing with 2 indexes for the past month or so and searching them returns results in under 250ms and sometimes even

boolean operators and score

2004-07-07 Thread Niraj Alok
Hi Guys, Finally I have sorted the problem of hits score thanks to the great help of Franck. I have hit another problem with the boolean operators now. When I search for "Winston and churchill" i get a set of perfectly acceptable results. But when I change the order, "churchill and winston" the r

Re: Optimizing for long queries?

2004-07-07 Thread Drew Farris
On Mon, 28 Jun 2004 10:04:40 +0200, Julien Nioche <[EMAIL PROTECTED]> wrote: > Hello Drew, > > I don't think it's in the FAQ. > Julien, Thanks for the advice, and the in-depth exploration of INDEX_INTERVAL here and on the developer's list. If I have the opportunity to run similar benchmarks com

Re: upgrade from Lucene 1.3 final to 1.4rc3 problem

2004-07-07 Thread Maxim Patramanskij
Hello Alex. I had the similar problem when I've upgraded to Lucene 1.4 rc3 from 1.3 final. After short investigation, I realized that problem is in the code of constructor FSDirectory() below: private FSDirectory(File path, boolean create) throws IOException { directory = path; lockDir = new F

Re: MultifieldQueryParser.parse()

2004-07-07 Thread Kelvin Tan
Hi Sergiu, First of all, if your application is web-based, its not necessary to programmatically construct the query based on user-input (via MultiFieldQueryParser). you can use luceneQueryConstructor.js in Lucene sandbox. You can find the documentation here: http://cvs.apache.org/viewcvs.cgi/*che

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread markharw00d
A colleague of mine found the fastest way to index was to use a RAMDirectory, letting it grow to a pre-defined maximum size, then merging it to a new temporary file-based index to flush it. Repeat this, creating new directories for all the file based indexes then perform a merge into one index o