RE: Indexing In Lucene

2006-10-04 Thread sachin
I am working on the Lucene... I have prepared the document about in-depth indexing. Unfortunately I can't attach it to the mail due to site constraint. But I can send it to your Personal Email .. --- Sachin -Original Message- From: Ajani, Akil (Cognizant) [mailto:[EMAIL PROTECTED] Sent:

Re: discontinuous range query

2006-10-04 Thread Doron Cohen
> > : The query you want is > : name:[A TO C] name:[G TO K] > : (each clause being SHOULD, or put another way, an implicit "OR" in between. > : > : The problem may be how you analyze the name field... is it tokenized at all? > : If so, you might be matching on first, last, and middle names, and the

Fwd: Re[2]: Fwd: Re[2]: 30 milllion+ docs on a single server

2006-10-04 Thread Artem Vasiliev
Hello Otis & all, I benchmarked it only subjectively - typical FieldCache'ing sort was an overkill for my humble server (now I give sharehound about 200M RAM for 14mln index which takes about 12G on disk). When sorting (using FieldCache) the first time after index change Lucene takes the whole in

Re: discontinuous range query

2006-10-04 Thread Chris Hostetter
: The query you want is : name:[A TO C] name:[G TO K] : (each clause being SHOULD, or put another way, an implicit "OR" in between. : : The problem may be how you analyze the name field... is it tokenized at all? : If so, you might be matching on first, last, and middle names, and the : combinatio

Advantage of putting lucene index in RDBMS

2006-10-04 Thread Mag Gam
I have been reading the lists for couple of week now, and I noticed people asking about placing their indexes into a RDBMS. What is the advantage of that? So far lucene was able to solve all my problems, but I am curious how else people are using it (especially with RDBMS). TIA

Find if words are in the same phrase?

2006-10-04 Thread Michael Imbeault
Is it possible with Lucene to limit a proximity query to a phrase to determine if two words are in the same phrase? Along the same train of thoughts, is it possible to determine if two words in a same phrase are separated by a word, or a list of words? Like for example Virus (some other words)

Re: discontinuous range query

2006-10-04 Thread Yonik Seeley
Hi Tom, The query you want is name:[A TO C] name:[G TO K] (each clause being SHOULD, or put another way, an implicit "OR" in between. The problem may be how you analyze the name field... is it tokenized at all? If so, you might be matching on first, last, and middle names, and the combination of

Re: Number Proximity Query

2006-10-04 Thread Chris Hostetter
: Another quick question on the score. If my custom Query is returning a score : that can be any value, and this custom Query is being used together with : other standard Query in a BooleanQuery. How do I ensure the value return by : the custome Query doesnt 'overshadow' the values return by other

RE: Spam filter for lucene project

2006-10-04 Thread Bruce Ritchie
Rejiv, You may want to take a look at http://akismet.com/development/ - I don't believe it's open source but it may be worth looking into. Regards, Bruce Ritchie > -Original Message- > From: Rajiv Roopan [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 04, 2006 4:32 PM > To: ja

Re: Spam filter for lucene project

2006-10-04 Thread Doron Cohen
> I was wondering if anyone knows of an open source > spam filter that I can add to my project to scan > the posts (which are just plain text) for spam? I am not aware of any (which does not mean there is none), but just wanted to draw your attention to a related discussion http://www.nabble.com/

discontinuous range query

2006-10-04 Thread Tom Hill
Hi - I'm having a bit of trouble building a query to match a range of values in a field that is not continuous. For an example, say I want to find all people with last names starting with A-C, and G-K. If I use MUST on each element of the range, then I get nothing. This I think I understan

Spam filter for lucene project

2006-10-04 Thread Rajiv Roopan
Hello, I'm currently running a site which allows users to post. Lately posts have been getting out of hand. I was wondering if anyone knows of an open source spam filter that I can add to my project to scan the posts (which are just plain text) for spam? thanks in advance. Rajiv

Re: Number Proximity Query

2006-10-04 Thread KEGan
Chris, thanks again for your reply. Really appreciate your help. Another quick question on the score. If my custom Query is returning a score that can be any value, and this custom Query is being used together with other standard Query in a BooleanQuery. How do I ensure the value return by the cu

Re: Number Proximity Query

2006-10-04 Thread Chris Hostetter
: (1) Should values returned by DocValues (return from ValueSource) must : always betwen 1.0 and 0.0 ? How is this value affect the overall document : scores, assuming there are others Query clauses as well that is perform on : the document (on other fields). The "values" returned by the various

Re: native Java DB (eg, Derby) to store the index: performance comparision?..

2006-10-04 Thread Aleksei Valikov
Hi. I've been wondering if anyone has tried to compare the performance of any 'native' Java DB as index storage mechanism vs Lucene custom implementation? I'm assuming that DB products should provide some functionality for 'free' right out of the box (correct, if I'm wrong): - easily managabl

Re: Searching documents on big index by using ParallelMultiSearcher is slow...

2006-10-04 Thread Simon Wistow
On Wed, Oct 04, 2006 at 01:55:06PM +, eks dev said: > have you considered hadoop "light" mesagging RPC, should have > significantly smaller latencies than RMI Yes, it's one of the things I'm looking at. - To unsubscribe, e-

Re: Number Proximity Query

2006-10-04 Thread KEGan
Erick, thanks for your reply. I have the LIA. But the sorting is not the solution I am looking for. As if I sort, I will lose out the relevancy from searches of other fields. I want the number proximity to be one in many of the fields that is searched. So the "num" field will contribute to the ov

Re: Searching documents on big index by using ParallelMultiSearcher is slow...

2006-10-04 Thread eks dev
have you considered hadoop "light" mesagging RPC, should have significantly smaller latencies than RMI - Original Message From: Simon Wistow <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, 4 October, 2006 3:26:38 PM Subject: Re: Searching documents on big index by u

Re: Searching documents on big index by using ParallelMultiSearcher is slow...

2006-10-04 Thread Scott
> Prelimary experimentation with a RemoteSearch/ParallelMultiSearcher > combination found that there were issues with the RMI causing > significant blocking. > > I'm currently playing around with trying alternative messaging > approaches so that I can also load balance requests. Wow, it is very i

Re: Searching documents on big index by using ParallelMultiSearcher is slow...

2006-10-04 Thread Simon Wistow
On Wed, Oct 04, 2006 at 08:14:38AM -0400, Haines, Ronald C. (LNG-DAY) said: > I too am interested in learning more about a large scale distributed > Lucene model. I'm also building a large scale (billions of documents) Lucene index. Prelimary experimentation with a RemoteSearch/ParallelMultiSear

Re: Searching documents on big index by using ParallelMultiSearcher is slow...

2006-10-04 Thread Scott
Indeed, I am using a bit complex Query (4 fields with OR). My index has fields Title, Sub-title, Content, Author. And search them by one query like as web search engine. Thank you for details about weight. So I need to avoid remote calls to rewrite() and docFreq(). I'll try to make Hits object

Re: Searching documents on big index by using ParallelMultiSearcher is slow...

2006-10-04 Thread Scott
My index increases periodically. Now 1 sec for 10G indexes. I am worried that futurely, how about response time for 20G, 30G,,, and 50G indexes? I'll try remote Hits (result set) object and the SearchMaster merges top N of them. Thank you. Erick Erickson wrote: OK, you're now officially bey

RE: Searching documents on big index by using ParallelMultiSearcher is slow...

2006-10-04 Thread Haines, Ronald C. \(LNG-DAY\)
Keep in mind, that depending on your queries (lots of terms, wildcards, date ranges), you can spend quite a bit of time during the 'Weight' calculation...this all happens pre-search. During the Weight calculation, you will be making remote calls to the rewrite() and docFreq() methods. There will

Re: Sudden FileNotFoundException

2006-10-04 Thread Hes Siemelink
One helpful thing to do is call IndexWriter.setInfoStream(...) and save the resulting output. This prints details about which segments were merged, and what the merged segment name is. This might provide some useful details for example was your deleted segments file one that was just merged away

Re: Sudden FileNotFoundException

2006-10-04 Thread Michael McCandless
Hes Siemelink wrote: > It happens from time to time... but I don't know how to reproduce it. > > Rebuilding this particular index unfortunately takes about 10 hrs, so it's > not feasable to delete the index and rebuild it when this happens... our > users would be missing a lot of search result

Re: Search in HTML code

2006-10-04 Thread Erick Erickson
Don't interpret my reponses as *recommending* a database, since I don't know much about your problem space. It may or may not be the right choice. Mostly, I was thinking that your particular use of lucene as stated wasn't playing to lucene's strengths. It may well be that lucene is a fine choice

Re: Searching documents on big index by using ParallelMultiSearcher is slow...

2006-10-04 Thread Erick Erickson
OK, you're now officially beyond my competence, so I'll have to wait for people who actually know Although if I read your stats right, you're getting approximately 1 sec response time over 10M documents on a 10G index. That's not bad at all. What kind of response time do you need? On 10/3/0

Re: Number Proximity Query

2006-10-04 Thread Erick Erickson
Sorry if this is a re-post, but I got an "undeliverable" error last time I tried to post it, something about SPAM. The nerve of that filter! I don't have my book handy, but you might want to check out "Lucene In Action". There's an example of how to create an index of restaurants

Re: Sudden FileNotFoundException

2006-10-04 Thread Hes Siemelink
Hi Mike, thank you for your detailed reply. I put my answers inline. On 10/4/06, Michael McCandless <[EMAIL PROTECTED]> wrote: Hes Siemelink wrote: > It happens from time to time... but I don't know how to reproduce it. > > Rebuilding this particular index unfortunately takes about 10 hrs, so

Re: QueryParser syntax French Operator

2006-10-04 Thread Patrick Turcotte
I've started to look into this (and the whole javacc syntax) I'll keep you posted on my results. Patrick Erik Hatcher wrote: Currently AND/OR/NOT are hardcoded into the .jj file. A patch to make this configurable would be welcome! Erik On Oct 3, 2006, at 11:15 AM, Patrick Turcotte w

Re: Sudden FileNotFoundException

2006-10-04 Thread Michael McCandless
Hes Siemelink wrote: It happens from time to time... but I don't know how to reproduce it. Rebuilding this particular index unfortunately takes about 10 hrs, so it's not feasable to delete the index and rebuild it when this happens... our users would be missing a lot of search results then! The

Re: QueryParser syntax French Operator

2006-10-04 Thread Erik Hatcher
On Oct 4, 2006, at 2:18 AM, Ronnie Kolehmainen wrote: Wouldn't the easiest fix be to just alter the users query string before passing it to queryparser (moving the semantics of your search app outside of lucene)? Something like: str.replaceAll(" ET ", " AND ").replaceAll(" OU ", " OR ").

Re: Sudden FileNotFoundException

2006-10-04 Thread Hes Siemelink
It happens from time to time... but I don't know how to reproduce it. Rebuilding this particular index unfortunately takes about 10 hrs, so it's not feasable to delete the index and rebuild it when this happens... our users would be missing a lot of search results then! There are a couple of wor

Re: Sudden FileNotFoundException

2006-10-04 Thread Karel Tejnora
Once I got same problem and following Jira not alone. I deleted index and rebuild it from source again and problem was gone. Im unable to reproduce it. Are you able to reproduce the problem? Karel java.io.FileNotFoundException: /lucene-indexes/mediafragments/_8km.fnm (No ---

Re: Search in HTML code

2006-10-04 Thread John Bugger
Thanks, Erick! I'll try to use LIKE query to database.

Sudden FileNotFoundException

2006-10-04 Thread Hes Siemelink
Hi all I'm having trouble with FileNotFoundException that pops up every once and a while. Everything works fine in my application (description below), but after running for some time (eg. 20 hours) an exception like this one may occur: java.io.FileNotFoundException: /lucene-indexes/mediafragment

Re: Number Proximity Query

2006-10-04 Thread KEGan
Thanks Chis. After spending half a day to "really" look into FunctionQuery (and related classes), and re-reading about Weight and Scorer. I think I am beginning to understand a bit. But more questions. (1) Should values returned by DocValues (return from ValueSource) must always betwen 1.0 and 0