Problem with Date Range search

2003-11-13 Thread Joseph Wilkicki
Hi all! I'm having a problem with searching dates. I created two documents with the same date, 08/27/2002, in a lastModified field and then try and search a range lastModified:[20020827 TO 20020827] (Other, wider ranges, don't seem to help). My understanding is that this should return my two doc

RE: Two possible solutions on Parallel Searching

2003-11-13 Thread Tomcat Programmer
I did a quick search and it looks like you can pick up the Java JVM from IBM at no cost. They say it passes Sun's compatibility tests. They have versions 1.3.1 and 1.4.1 on the site: http://www-106.ibm.com/developerworks/java/jdk/ -Tom --- Tomcat Programmer <[EMAIL PROTECTED]> wrote: > > I s

RE: Two possible solutions on Parallel Searching

2003-11-13 Thread Tomcat Programmer
I saw an article from IBM somewhere, talking about how you go about giving options to the JVM to use all the non-reserved memory segments (on AIX which has segmented memory) and this would allow more than a 2GB heap. The point of that statement is that it sounds like IBM's JVM can do it. I'm not

Re: QueryParser Rules article (Erik Hatcher)

2003-11-13 Thread Tomcat Programmer
Hi Eric, Thanks for the replies, and your consideration on this problem. In my case, I use the non-static method because I want to set some properties (most importantly the default operator to AND) for the query parser. Looking at the code snip provided, I guess the only thing the query parser

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Jie Yang
Are there ways to build transient indexes in memory in less than 1 second from the first query results? --- petite_abeille <[EMAIL PROTECTED]> wrote: > > On Nov 13, 2003, at 22:32, Jie Yang wrote: > > > I am trying to optimse the 500 OR > > terms so that it does not do a full 2 millions > docs

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread petite_abeille
On Nov 13, 2003, at 22:32, Jie Yang wrote: I am trying to optimse the 500 OR terms so that it does not do a full 2 millions docs search but on the 1000 returned. Would it be beneficial to move the first result set into its own (transient) index to perform the second part of your query? PA.

RE: Two possible solutions on Parallel Searching

2003-11-13 Thread Chong, Herb
the one by NaturalBridge might, but it is not cheap. Herb... -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thursday, November 13, 2003 4:57 PM To: Lucene Users List Subject: Re: Two possible solutions on Parallel Searching I don't know of a Java implementation wh

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread Doug Cutting
Jie Yang wrote: In this case, probably using a single RAMDirectory would allow me to run parallel searching without worry about disk access. Well, anyone tried to have a RAMDirectory of 5G in size? I don't know of a Java implementation which lets you have a heap larger than 2GB. In my experience,

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Doug Cutting
Jie Yang wrote: --- Erik Hatcher <[EMAIL PROTECTED]> wrote: Well, not quite, User normally enters a search string A that normally returns 1000 out of 2 millions docs. I then append A with 500 OR conditions... A AND (B or C or ... or x500). Are you adding the same 500 terms to each query? Or even

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread Jie Yang
In this case, probably using a single RAMDirectory would allow me to run parallel searching without worry about disk access. Well, anyone tried to have a RAMDirectory of 5G in size? --- Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Multiple threads against the same index or multiple > indices - n

RE: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Chong, Herb
you're doing TREC-style query expansion using automatic relevance feedback? Herb -Original Message- From: Jie Yang [mailto:[EMAIL PROTECTED] Sent: Thursday, November 13, 2003 4:33 PM To: Lucene Users List Subject: Re: Query Filters on term A in query "A AND (B OR C OR D)" Well, not

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Jie Yang
--- Erik Hatcher <[EMAIL PROTECTED]> wrote: > Are we talking about that query being entered by the > user and you > handing it just like that to QueryParser? If so, > then QueryFilter > won't help. Well, not quite, User normally enters a search string A that normally returns 1000 out of 2 mill

RE: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Ryan Clifton
How challenging would it be to add something to QueryParser to allow you to specify that you want to use filters? I have a similiar case where I do a search for term1 AND term2 AND links:http???www?url?com?dir* If lucene would use order of operations or in some way do the first two searches fi

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread Otis Gospodnetic
Multiple threads against the same index or multiple indices - no advantage - think about the mechanical parts involved (disk head). Multiple threads against indices on different disks (not just paritions!) - yes, that would be faster. Reading the index from the disk is the bottleneck, not the CPU

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Erik Hatcher
On Thursday, November 13, 2003, at 04:07 PM, Jie Yang wrote: Erik, Just to make sure I understand you right, In an example query: ZipCode:CA10927 AND Gender:Male Are we talking about that query being entered by the user and you handing it just like that to QueryParser? If so, then QueryFilter

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Jie Yang
--- Erik Hatcher <[EMAIL PROTECTED]> wrote: > On Thursday, November 13, 2003, at 03:28 PM, Dan > Quaroni wrote: > > To my knowledge the answer is No, lucene performs > each query > > separately and > > then performs the joins after it has all the > results. This is > > actually a > > rather se

RE: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Dan Quaroni
I guess I was wrong, then... But I have 262 indexes with a combined 130 or so million documents and at times the memory usage for a single query exceeds 1.3 gigs with me only taking the top 25 of the hits. We pushed the jvm to 1.6 gigs and it seems to be doing OK, but if it's not the results from

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Doug Cutting
Dan Quaroni wrote: name:Bob's Discount Furniture AND state:California AND city:San Diego Now, that query is going to retrieve EVERY Bob's discount furniture, EVERY company in California, and EVERY city in San Diego and then join them. That makes the memory requirements for this query far higher t

RE: Two possible solutions on Parallel Searching

2003-11-13 Thread Chong, Herb
multiple threads on a single disk will likely results in significantly slower searching, possibly an order of magnitude or more slowdown depending on many factors such as available RAM, etc. Herb... -Original Message- From: Jie Yang [mailto:[EMAIL PROTECTED] Sent: Thursday, November 13,

Re: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Erik Hatcher
On Thursday, November 13, 2003, at 03:28 PM, Dan Quaroni wrote: To my knowledge the answer is No, lucene performs each query separately and then performs the joins after it has all the results. This is actually a rather serious problem when it comes to searches in large indexes where a single

RE: Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Dan Quaroni
To my knowledge the answer is No, lucene performs each query separately and then performs the joins after it has all the results. This is actually a rather serious problem when it comes to searches in large indexes where a single field is very important but has a very low uniqueness. For example,

Query Filters on term A in query "A AND (B OR C OR D)"

2003-11-13 Thread Jie Yang
Can anyone clarify a bit more ont he issue below? I don't seems can find out any hints in this list. Much Thanks.. > > Again, I still feel a bit curious and want to find > > out does lucene do (or in the future) pre-filter > > on "AND join conditions". For example, A AND (B OR > > C OR D). if

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread Jie Yang
--- Doug Cutting <[EMAIL PROTECTED]> wrote: > First, note that the approaches you describe will > only improve > performance if you have multiple CPUs and/or > multiple disks holding the > indexes. > > Second, MultiSearcher is currently implemented to > search indexes > serially, not each in a

Re: Objection to using /tmp for lock files.

2003-11-13 Thread Dror Matalon
On Thu, Nov 13, 2003 at 10:18:39AM -0800, Doug Cutting wrote: > Dror Matalon wrote: > >In there a reason why RODirectory shouldn't just be rolled into lucene? > > > >http://www.csita.unige.it/software/free/lucene/ > > This just looks like a version of FSDirectory with lock files disabled. > I th

Re: Objection to using /tmp for lock files.

2003-11-13 Thread Doug Cutting
Dror Matalon wrote: In there a reason why RODirectory shouldn't just be rolled into lucene? http://www.csita.unige.it/software/free/lucene/ This just looks like a version of FSDirectory with lock files disabled. I think it would be better to just make it easier to disable lock files. Currently

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread Doug Cutting
William W wrote: If I have two indexes and use the MultiSearcher will it be faster than only one index with all the documents ? No, in fact it would be slower. However it could be faster if (a) someone contributes a parallel version of MultiSearcher and (b) you're either running on a multiple-

Re: Objection to using /tmp for lock files.

2003-11-13 Thread petite_abeille
On Nov 13, 2003, at 19:00, Dror Matalon wrote: I've been experimenting with it and it seems to work as advertised. It has the advantage of not requiring *any* write capability in /tmp or anywhere else. There is a system property to turn off the lock files altogether. PA.

Re: Objection to using /tmp for lock files.

2003-11-13 Thread Dror Matalon
In there a reason why RODirectory shouldn't just be rolled into lucene? http://www.csita.unige.it/software/free/lucene/ I've been experimenting with it and it seems to work as advertised. It has the advantage of not requiring *any* write capability in /tmp or anywhere else. Regards, Dror On T

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread William W
Hi Folks, If I have two indexes and use the MultiSearcher will it be faster than only one index with all the documents ? Thanks, William. From: Doug Cutting <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: Lucene Users List <[EMAIL PROTECTED]> Subject: Re: Two possible

Re: Objection to using /tmp for lock files.

2003-11-13 Thread Doug Cutting
Kevin A. Burton wrote: When I first read this changelog entry: > 2. Changed file locking to place lock files in >System.getProperty("java.io.tmpdir"), where all users are >permitted to write files. This way folks can open and correctly >lock indexes which are read-only to them. I

Re: Two possible solutions on Parallel Searching

2003-11-13 Thread Doug Cutting
First, note that the approaches you describe will only improve performance if you have multiple CPUs and/or multiple disks holding the indexes. Second, MultiSearcher is currently implemented to search indexes serially, not each in a separate thread. To implement multi-threaded searching one c

Two possible solutions on Parallel Searching

2003-11-13 Thread Jie Yang
I had a thought on my earlier post on "Poor Performance when searching for 500+ terms". The problem is on how to improve the performance when searching for 500+ OR search terms. i.e. enter a search string of : W1 OR W2 OR W3 OR .. OR w500. I thought I could rewrite the MultiSearcher class s

Re: Poor Performance when searching for 500+ terms

2003-11-13 Thread Otis Gospodnetic
> I am not using RAMDirectory due to the large size of > index file. the index generated on hard disc is 1.57G > for 1 million documents, each document has average 500 > terms. I am using Field.UnStored(fieldName, terms), so > i beliece I am not storing the documents, just the > index. (is that rig

RE: Reopen IndexWriter after delete?

2003-11-13 Thread Otis Gospodnetic
I suggest checking the list archive. Doug has explained the reasons behind the current design several times. Otis --- "Wilton, Reece" <[EMAIL PROTECTED]> wrote: > I agree it's a bit of a strange design. > > It seems that there should be one class that handles all > modifications > of the index

Re: Reopen IndexWriter after delete?

2003-11-13 Thread Otis Gospodnetic
Because Lucene has to first find the segment that the specified document is in, and this is done via IndexReaders, not IndexWriters. More about this in the Lucene book. Otis --- Dror Matalon <[EMAIL PROTECTED]> wrote: > Which begs the question: why do you need to use an IndexReader rather > tha

Re: Latent Semantic Indexing

2003-11-13 Thread Otis Gospodnetic
No, sorry. Otis --- Ralf Bierig <[EMAIL PROTECTED]> wrote: > Does Lucene implement Latent Semantic Indexing? Examples? > > Ralf > > -- > NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... > Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService > > Jetzt kostenlos anmelden

Re: Vector Space Model in Lucene?

2003-11-13 Thread Otis Gospodnetic
Lucene does not implement vector space model. Otis --- [EMAIL PROTECTED] wrote: > Hi, > > does Lucene implement a Vector Space Model? If yes, does anybody have > an > example of how using it? > > Cheers, > Ralf > > -- > NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... > Fotoalbum

Re: Poor Performance when searching for 500+ terms

2003-11-13 Thread Jie Yang
Thanks Julian I am not using RAMDirectory due to the large size of index file. the index generated on hard disc is 1.57G for 1 million documents, each document has average 500 terms. I am using Field.UnStored(fieldName, terms), so i beliece I am not storing the documents, just the index. (is that

Re: fuzzy searches

2003-11-13 Thread petite_abeille
On Nov 11, 2003, at 21:02, Bruce Ritchie wrote: Just a note the LSI is encumbered by US patents 4,839,853 and 5,301,109. It would be wise to make sure that any implementation is either blessed by the patent holders or does not infringe on the patents. Since when did developers turn into armchai

Re: Can use Lucene be used for this

2003-11-13 Thread Erik Hatcher
On Thursday, November 13, 2003, at 07:16 AM, Hackl, Rene wrote: Yes and yes. Users range from Information Professionals to "naive" end users. If there's a string like "N-(t-Butyl)-N-(3,5-dinitrobenzoyl)-nitroxyl" users can be expected to search for "dinitro", "3,5-dinitro", "nitrobenz" etc. Each

Re: fuzzy searches

2003-11-13 Thread petite_abeille
On Nov 13, 2003, at 15:09, Thomas Krämer wrote: i am not familiar with intelectual property law, but it sounds somewhat strange to me, that it is possible to patent an abstract idea of hom extracting information from data. The process of "Spreading Cream Cheese On Bagels" (C) (R) (TM) has been

Re: fuzzy searches

2003-11-13 Thread Thomas Krämer
Hi Bruce, i am not familiar with intelectual property law, but it sounds somewhat strange to me, that it is possible to patent an abstract idea of hom extracting information from data. i can understand, that it is forbidden to reuse/modify sourcecode of a given implementation of lsi, but why s

RE: Can use Lucene be used for this

2003-11-13 Thread Chong, Herb
i suggest that you use a special tokenizer that breaks chemical names into their constituent parts and index them as if they were words. Herb -Original Message- From: Hackl, Rene [mailto:[EMAIL PROTECTED] Sent: Thursday, November 13, 2003 7:17 AM To: 'Lucene Users List' Subject: Re:

Re: Can use Lucene be used for this

2003-11-13 Thread Hackl, Rene
>> documents contain very long strings for chemical substances, users are >> interested in certain parts of the string e.g. find all documents that >> comprise "*foo*" be it "1-foo-bar" or "rab-oof-13-foonyl-naphthalene"). > So you're saying you want users to be able to search for "of-13" and > m

Re: QueryParser Rules article (Erik Hatcher)

2003-11-13 Thread Erik Hatcher
On Wednesday, November 12, 2003, at 11:52 PM, Tomcat Programmer wrote: When using the QueryParser class, the parse method will throw a TokenMgrError when there is a syntax error even as simple as a missing quote at the end of a phrase query. According to the javadoc, you should never see this clas

Re: Poor Performance when searching for 500+ terms

2003-11-13 Thread Julien Nioche
Hello, Since there are a lot of Term objects in your Query, your application must spend a lot of time collecting information about those Terms. 1/ Do you use RAMDirectory? Loading the whole Directory into memory will increase speed - your index must not be too big though 2/ You are probably not

Re: Can use Lucene be used for this

2003-11-13 Thread Erik Hatcher
On Thursday, November 13, 2003, at 03:22 AM, Hackl, Rene wrote: documents contain very long strings for chemical substances, users are interested in certain parts of the string e.g. find all documents that comprise "*foo*" be it "1-foo-bar" or "rab-oof-13-foonyl-naphthalene"). So you're saying you

Re: Can use Lucene be used for this

2003-11-13 Thread Hackl, Rene
> If you can figure out how to tell Lucene what the parts of strings are > when you create the index, it should be easy to do this. Well, sometimes different kinds of brackets, hyphens and interpunctation signs would inherently belong to strings, sometimes not. The whole collection as such is ra

Re: Can use Lucene be used for this

2003-11-13 Thread Dror Matalon
On Thu, Nov 13, 2003 at 09:22:57AM +0100, Hackl, Rene wrote: > Hi John, > > Indeed, the RCO index is ok for prefix-style wildcards. But it doesn't work > for _simultaneous_ left and right truncation ("*oba*"). I have no idea about > how often this kind of search is actually employed, but in this p

Re: Can use Lucene be used for this

2003-11-13 Thread Hackl, Rene
Hi John, Indeed, the RCO index is ok for prefix-style wildcards. But it doesn't work for _simultaneous_ left and right truncation ("*oba*"). I have no idea about how often this kind of search is actually employed, but in this particular context it is really needed (I sketched this before on this l