RE: How to avoid score calculation completely?

2007-05-24 Thread Ramana Jelda
But I also see importance of ignoring score calculation. If you put it aside performance gain, is there any possibility to completely ignore scoring calculation? Jelda > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf > Of Yonik Seeley > Sent: Wednesday,

Re: WhitespaceAnalyzer [was: Re: regaridng Reader.terms()]

2007-05-24 Thread Chris Hostetter
: return !Character.isWhitespace(c); : And my class override that method as this: : return !((int)c==32); in my opinion that's a pretty naive change ... it won't split on tab characters or newlines ... even for trivial ASCII text that's probably not what you want. : I think the Charact

Re: Lock obtain timed exception

2007-05-24 Thread Michael McCandless
"Laxmilal Menaria" <[EMAIL PROTECTED]> wrote: > > I am getting Lock obtain timed exception while Searching in Index. > > > > My Steps: I have created a Lucene Index at first week of may 2007, after > > that I have nothing changed in index folder. Just I am searching. Searcher > > code have only M

Searching on a Rapidly changing Index

2007-05-24 Thread Simon Wistow
I've built a Lucene system that gets rapidly updated - documents are supposed to be searchable immeidately after they've been indexed. As such I have a Writer that puts new index, update and delete tasks into a queue and then has a thread which consumes them and applies them to the index using

Re: WITH_POSITIONS_OFFSETS versus WITH_OFFSETS

2007-05-24 Thread Grant Ingersoll
WITH_OFFSETS gives the equivalent of Token.startOffset and Token.endOffset information which is the actual offset in the String (although it can be manipulated), while WITH_POSITIONS gives the position information (which can also be manipulated). Position info tells where the token occurs

Re: Lock obtain timed exception

2007-05-24 Thread Laxmilal Menaria
yes, I am getting the JVM crash exception in logs. # # An unexpected error has been detected by Java Runtime Environment: # # java.lang.OutOfMemoryError: requested 32756 bytes for ChunkPool::allocate. Out of swap space? # # Internal Error (414C4C4F434154494F4E0E4350500065), pid=25596, tid=90152

Lucene code injection?

2007-05-24 Thread Joe
Hi, I indexed emails. And now i want to restrict the search functionality for users so they only can search for emails to/from him. i know the email address of the user so my plan is to do it in the following way: The user enters some search parameters, they are combined in a query. This is a mi

Re: Lucene code injection?

2007-05-24 Thread Joe
Hi, This sounds good. As for the code injection it is up to you to sanitize the request before it goes to lucene, probably by filling the email field yourself and not rely on the user input for the email address I hoped i havent to sanitize the user input cause the email address query is ANDed

RE: Lucene code injection?

2007-05-24 Thread Daan de Wit
Hi Joe, It might be possible when you append the restriction before parsing the user query with the QueryParser, but I'm not sure. I recommend first parsing the query, and then constructing a BooleanQuery with the parsed user query and the e-mail term both as must. Another approach would be to use

Re: HitCollector or Hits

2007-05-24 Thread Erick Erickson
I know of no way to alter the Hits behavior, I recommend using a TopDocs/TopDocCollector. But be aware that if you load the document for each one, you may incur a significant penalty, although the lazy-loading helped me a lot, see FieldSelector. On 5/23/07, Carlos Pita <[EMAIL PROTECTED]> wr

RE: Lucene code injection?

2007-05-24 Thread Damien McCarthy
Hi Joe, It would probably be cleaner to use a QueryFilter rather than doing the AND. Take a look at http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/QueryFilter .html Also I'm not sure that using the sent to field will work - people may receive email from a list, such as this, whe

RE: Searching on a Rapidly changing Index

2007-05-24 Thread Mordo, Aviran (EXP N-NANNATEK)
You can create two indexes. One will be for new documents, let say the last 24 hours and another one for older documents. This way you will only update a small portion of your index while the large index will remain relatively constant so you don't have to get a new searcher for it. HTH Aviran ht

RE: Lucene code injection?

2007-05-24 Thread Mordo, Aviran (EXP N-NANNATEK)
This sounds good. As for the code injection it is up to you to sanitize the request before it goes to lucene, probably by filling the email field yourself and not rely on the user input for the email address. HTH Aviran http://www.aviransplace.com http://shaveh.co.il -Original Message-

Re: Searching on a Rapidly changing Index

2007-05-24 Thread Erick Erickson
Another option would be to only re-open your searcher when actually needed, that is after the index has changed. This only does you some good when you have some hope that there are sizable gaps in your modifications Another possibility is to relax the "immediately" constraint. Would a maximum

Re: Lucene code injection?

2007-05-24 Thread Joe
Damien McCarthy schrieb: Hi Joe, It would probably be cleaner to use a QueryFilter rather than doing the AND. Take a look at http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/QueryFilter .html ok if its not to slow i go this way. Also I'm not sure that using the sent to fiel

Re: Lucene code injection?

2007-05-24 Thread Joe
Hi, Hi Joe, It might be possible when you append the restriction before parsing the user query with the QueryParser, but I'm not sure. I recommend first parsing the query, and then constructing a BooleanQuery with the parsed user query and the e-mail term both as must. yes thats the idea. An

Re: Searching on a Rapidly changing Index

2007-05-24 Thread Simon Wistow
On Thu, May 24, 2007 at 09:28:30AM -0400, Erick Erickson said: > If that's unacceptable, you can *still* open up a new reader in the > background and warm it up before using it. "immediately" then > becomes 5-10 seconds or so. This is currently what I'm doing using a list of previous performed qu

Re: Searching on a Rapidly changing Index

2007-05-24 Thread Joe Shaw
Hi, On 5/24/07, Erick Erickson <[EMAIL PROTECTED]> wrote: If that's unacceptable, you can *still* open up a new reader in the background and warm it up before using it. "immediately" then becomes 5-10 seconds or so. I've seen the term "warming" used a few times on the various lists. What const

Re: Searching on a Rapidly changing Index

2007-05-24 Thread Erick Erickson
Yep. You probably want to do some sorting by other than relevancy too in order to fill the sort caches. Erick On 5/24/07, Joe Shaw <[EMAIL PROTECTED]> wrote: Hi, On 5/24/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > If that's unacceptable, you can *still* open up a new reader in the > b

Re: HitCollector or Hits

2007-05-24 Thread Carlos Pita
Hi Erick, thank you for your prompt answer. What do you mean by loading the document? Accessing one of the stored fields? In that case I'm afraid I would need to do it. For example, in the aforementioned case of a result of products, I have to look at any product store_id, which is stored along t

How to search more than one word?

2007-05-24 Thread Rodrigo F Valverde
Hi all! I implemented a searcher with Lucene and i´m trying to search two words, the both into the same text file, but...i can´t! When I search the first word and the second separated, everithing happens ok, but when together, with or wtithout "AND" or "+"...nothing is found! :( Can somebody h

Re: How to avoid score calculation completely?

2007-05-24 Thread Yonik Seeley
On 5/24/07, Ramana Jelda <[EMAIL PROTECTED]> wrote: But I also see importance of ignoring score calculation. If you put it aside performance gain, is there any possibility to completely ignore scoring calculation? Yes, for unsorted results use a hit collector and no sorting will be done by sco

Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

2007-05-24 Thread Walt Stoneburner
Hi, I'm trying to figure what I need to do with Lucene to score a document higher when it has a larger number of unique search terms that are hit, rather than term frequency counts. A quick example. If I'm searching for "BIRD CAT DOG" (all should clauses), then I want ...a document with B

Re: How to avoid score calculation completely?

2007-05-24 Thread Michael McCandless
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On 5/24/07, Ramana Jelda <[EMAIL PROTECTED]> wrote: > > But I also see importance of ignoring score calculation. > > > > If you put it aside performance gain, is there any possibility to completely > > ignore scoring calculation? > > Yes, for unsorted r

Shortest snippet in Search results

2007-05-24 Thread Prasanna Seshadri
Hello users, I am right now developing an algorithm to calculate the shortest snippet from the search results for a given keyword of length n (from user query). From the lucene source I found that there is a method getBestFragments which would do the same. However its very hard to interpret

RE: How to avoid score calculation completely?

2007-05-24 Thread Zhang, Lisheng
Hi, Thanks for helps! Yes, along the line you mentioned we can reduce the amount of calculation, but we still need to loop through to count all docs, so time may still be O(n), I am wondering if we can avoid the loop to get count directly? Best regards, Lisheng -Original Message- From: M

Re: HitCollector or Hits

2007-05-24 Thread Erick Erickson
You're on the right track. But that said, access to anything that's indexed (stored or not) should be pretty quick. Things stored, but not indexed, are costlier. This might drive your decision on what to index .vs. store. Loading the document is anything like IndexReader.document(), or Hits.d

Re: How to search more than one word?

2007-05-24 Thread Erick Erickson
Not until you give us more information . In particular, what analyzers you use at index and search time. What the string was originally and how you indexed it. What query.toString() shows you. Best Erick On 5/24/07, Rodrigo F Valverde <[EMAIL PROTECTED]> wrote: Hi all! I implemented a search

RE: How to avoid score calculation completely?

2007-05-24 Thread Michael McCandless
"Zhang, Lisheng" <[EMAIL PROTECTED]> wrote: > Hi, Thanks for helps! > > Yes, along the line you mentioned we can reduce the amount > of calculation, but we still need to loop through to count > all docs, so time may still be O(n), I am wondering if we > can avoid the loop to get count directly?

maxDoc and arrays

2007-05-24 Thread Carlos Pita
Hi all, Is there any guaranty that the maxDoc returned by a reader will be about the total number of indexed documents? The motivation of this question is that I want to associate some info to each document in the index, and in order to access this additional data in O(1) I would like to do this

Re: HitCollector or Hits

2007-05-24 Thread Carlos Pita
Hi Erick, I don't think that FieldSelector would be that valuable in my case because I just need to access a few fields, and those are all fields that are in fact stored (and indexed too). I was thinking of keeping this extra information in memory, precisely into an array mapping doc ids to the d

Re: Who has sample code of remote multiple servers multiple indexes searching?

2007-05-24 Thread Su.Cheng
Hi, I found the problem. The version of Lucene on server is 2.1 while on client is 1.9. Thanks On Wed, 2007-05-23 at 13:52 -0600, Su.Cheng wrote: > Hi, > I studied "5.6 Searching across multiple Lucene indexes 178" in < in action>>. > > I have 2 remote serarch computers(SearchServer) work as

KeywordAnalyzer vs. Field.Index.UN_TOKENIZED

2007-05-24 Thread dontspamterry
Hi all, I have an ID field which I index using the KeywordAnalyzer. Since this analyzer tokenizes the entire stream as a single token, would you say the end result is the same as using any analyzer and specifying this ID field as untokenized? The latter approach does not use the analyzer so would

Re: maxDoc and arrays

2007-05-24 Thread Erick Erickson
See below... On 5/24/07, Carlos Pita <[EMAIL PROTECTED]> wrote: Hi all, Is there any guaranty that the maxDoc returned by a reader will be about the total number of indexed documents? No. It will always be at least as large as the total documents. But that will also count deleted documents

Improving Search Performance on Large Indexes

2007-05-24 Thread Scott Sellman
Hello, Currently we are attempting to optimize the search time against an index that is 26 GB in size (~35 million docs) and I was wondering what experiences others have had in similar attempts. Simple searches against the index are still fast even at 26GB, but the problem is our application

Res: How to search more than one word?

2007-05-24 Thread Rodrigo F Valverde
I will try to resume the code: INDEX TIME - IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true); - writer.setUseCompoundFile(false); - while has files into the given dir... - Document doc = new Document(); - doc.add(new Field("content", new FileReader(file))); - doc.add(ne

Re: maxDoc and arrays

2007-05-24 Thread Carlos Pita
Why wouldn't numdocs serve? Because the document id (which is the array index) would be in the range 0 ... maxDoc and not 0 ... numDocs, wouldn't it? Cheers, Carlos Best Erick The motivation of this question is that I want to associate some info to > each document in the index, and in ord

Re: Scoring on Number of Unique Terms Hit, Not Term Frequency Counts

2007-05-24 Thread Grant Ingersoll
Have a look at the DisjunctionMaxQuery, I think it might help, although I am not sure it will fully cover your case. -Grant On May 24, 2007, at 11:22 AM, Walt Stoneburner wrote: Hi, I'm trying to figure what I need to do with Lucene to score a document higher when it has a larger number of

Re: maxDoc and arrays

2007-05-24 Thread Carlos Pita
No. It will always be at least as large as the total documents. But that will also count deleted documents. Do you mean that deleted document ids won't be reutilized, so the index maxDoc will grow more and more with time? Isn't there any way to compress the range? It seems strange to me, con

RE: How to filter fields with hits from result set

2007-05-24 Thread Andreas Guther
Eric, I was pursuing a different direction yesterday which is not fast enough. Basically I was using the highlighter to figure out if a page has a hit or not. But that is too expensive. I end up with 15 ms per page and that sums up. I have to allow ad-hoc queries, so it sounds like the solution

Re: Improving Search Performance on Large Indexes

2007-05-24 Thread Otis Gospodnetic
Scott, Yes, take your big index and split it into multiple smaller shards. Put those shards in different servers and then query them remotely (using the provided RMI thing in Lucene or using something custom), take top N results from each searcher, merge those, and take top N from the merged r

Re: KeywordAnalyzer vs. Field.Index.UN_TOKENIZED

2007-05-24 Thread Otis Gospodnetic
Terry, I think you are right. Just use UN_TOKENIZED, that will do what you need. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: dontspamterry <[EMAIL PROTECTED]> To: java-user@lucene.

Re: HitCollector or Hits

2007-05-24 Thread Otis Gospodnetic
Carlos, It sounds like you'll have to build logic that knows when the index has been reopened and repopulates your cache. Take a look at Solr, it does this type of stuff. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Shar

Re: maxDoc and arrays

2007-05-24 Thread Otis Gospodnetic
Carlos: Answer to your last question: No, but if you look in JIRA, Karl Wettin has written something that does have a notification mechanism that you are describing. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share --

Re: KeywordAnalyzer vs. Field.Index.UN_TOKENIZED

2007-05-24 Thread dontspamterry
Hi Otis, I tried both ways, did some queries, and results are the same, so I guess it's a matter of preference??? -Terry Otis Gospodnetic wrote: > > Terry, > I think you are right. Just use UN_TOKENIZED, that will do what you need. > > Otis > . . . . . . . . . . . . . . . . . . . . . . . .

Res: Res: How to search more than one word?

2007-05-24 Thread Rodrigo F Valverde
Hi again! That´s all diferent now! I´m no more using the "reader.search()"...now, i´m using the QueryParser: - QueryParser qp = new QueryParser("content", new StandardAnalyzer()); - query = qp.parse(keyWordToSearch); now, it works fine! :D But now I need to know the diference between them! :) T

Re: KeywordAnalyzer vs. Field.Index.UN_TOKENIZED

2007-05-24 Thread Steven Rowe
Hi Terry, The one place I know where KeywordAnalyzer is definitely useful is when it is used in conjunction with PerFieldAnalyzerWrapper. Steve dontspamterry wrote: > Hi Otis, > > I tried both ways, did some queries, and results are the same, so I guess > it's a matter of preference??? > > -Te

Re: HitCollector or Hits

2007-05-24 Thread Chris Hostetter
: just need to access a few fields, and those are all fields that are in fact : stored (and indexed too). I was thinking of keeping this extra information : in memory, precisely into an array mapping doc ids to the data structure. I if the fields you need are indexed and single valued (and untoken

Re: How to filter fields with hits from result set

2007-05-24 Thread Erick Erickson
Well, my data may not be too helpful. But some of the books I'm counting hits for are a thousand-plus pages. We haven't had performance issues, but that's only saying "no customer has complained yet". The old solution we used did something similar to what you're talking about, basically streaming

Re: maxDoc and arrays

2007-05-24 Thread Erick Erickson
Document IDs will be re-utilized, after, say, optimization. One consequence of this is that optimization will change the IDs of *existing* documents. You're right, that numdocs may well be shorter than maxdocs. That's what I get for reading quickly... Best Erick On 5/24/07, Carlos Pita <[EMAIL

Re: Improving Search Performance on Large Indexes

2007-05-24 Thread Su.Cheng
Hi Scott, I met the same situation as you(index 100M documents). If the computer has only one CPU and one disk, ParallelMultiSearcher is slower than MultiSearcher. I wrote an email "Who has sample code of remote multiple servers multiple indexes searching" yesterday. If you have any suggestion,

Re: maxDoc and arrays

2007-05-24 Thread Carlos Pita
That's no problem, I can regenerate my entire extra data structure upon periodic index optimization. That way the array size will be about the size of the index. What I find more difficult is to know the id of the last added/removed document. I need it to update the in-mem structure upon more fin

Re: Res: How to search more than one word?

2007-05-24 Thread Erick Erickson
If you haven't, I *strongly* recommend you get a copy of luke. google lucene and luke to find it. It allows you to examine your index and also to see how queries parse. It's invaluable. I can't say exactly what the difference is, but there are several possibilities. Note that in general it's best

Re: maxDoc and arrays

2007-05-24 Thread Erick Erickson
From the Javadoc for IndexReader. Returns one greater than the largest possible document number. This may be used to, e.g., determine how big to allocate an array which will have an element for every document number in an index. Isn't that what you're wondering about? Erick On 5/24/07, Ca

Re: maxDoc and arrays

2007-05-24 Thread Yonik Seeley
On 5/24/07, Carlos Pita <[EMAIL PROTECTED]> wrote: I need it to update the in-mem structure upon more fine-grained index changes. Any ideas? Currently, a deleted doc is removed when the segment containing it is involved in a segment merge. A merge could be triggered on any addDocument(), mak

Re: maxDoc and arrays

2007-05-24 Thread Carlos Pita
Yes Erick, that's fine. But the fact is that I'm not sure whether the next added document will have an id equal to maxDocs. If this is guaranteed, then I will update the maxDocs slot of my extra data structure upon document addition and get rid of the hits.id(0) slot upon document deletion. Then,

Re: maxDoc and arrays

2007-05-24 Thread Yonik Seeley
On 5/24/07, Carlos Pita <[EMAIL PROTECTED]> wrote: Yes Erick, that's fine. But the fact is that I'm not sure whether the next added document will have an id equal to maxDocs. Yes. The highest docId will always be the last document added, and docIds are never re-arranged with respect to each ot

Res: Res: How to search more than one word?

2007-05-24 Thread Rodrigo F Valverde
Yes guy, i have luke yet! :) The words i used were: "maria" and "amanda". The first word, is in one text file and the second is in the same one and another (so, two files). Changing the "IndexSearcher.search()" by "QueryParser.parse()" and keep everything equal, all works fine. By luke and by

Re: maxDoc and arrays

2007-05-24 Thread Carlos Pita
I have done some benchmarks. Keeping things in an array makes the entire search, including postprocessing from first to last id for a big result set, extremely fast. So I would really like to implement this approach. But I'm concerned about what Yonik remarked. I could use a large mergeFactor but

Re: maxDoc and arrays

2007-05-24 Thread Chris Hostetter
: extremely fast. So I would really like to implement this approach. But I'm : concerned about what Yonik remarked. I could use a large mergeFactor but : anyway, just to be sure, is there a way to make the index inform my : application of merging events? this entire thread seems to be a discussio

Re: maxDoc and arrays

2007-05-24 Thread Carlos Pita
Mh, some of my fields are in fact multivaluated. But anyway, I could store them as a single string and split after retrieval. Will FieldCache work for the first search with some query or just for the successive ones, for which the fields are already cached? Cheers, Carlos On 5/24/07, Chris Hoste

Re: maxDoc and arrays

2007-05-24 Thread Chris Hostetter
: Mh, some of my fields are in fact multivaluated. But anyway, I could store : them as a single string and split after retrieval. : Will FieldCache work for the first search with some query or just for the : successive ones, for which the fields are already cached? The first time you access the ca

RE: Improving Search Performance on Large Indexes

2007-05-24 Thread Scott Sellman
Hi Su, I came across some discussion of ParallelMultiSearcher and RMI in chapter 5 of the book Lucene in Action. There are a couple of examples in there, so might be a good place to start. -Scott -Original Message- From: Su.Cheng [mailto:[EMAIL PROTECTED] Sent: Thursday, May 24, 2007

Re: maxDoc and arrays

2007-05-24 Thread Carlos Pita
Nice, I will write the ids into a byte array with a DataOutputStream and then marshal that array into a String with a UTF8 encoding. This way there is no need for parsing or splitting, and the encoding is space efficient. This marshaled String will be cached with a FieldCache. Thank you for your s

Re: Improving Search Performance on Large Indexes

2007-05-24 Thread Sharad Agarwal
Su.Cheng wrote: Hi Scott, I met the same situation as you(index 100M documents). If the computer has only one CPU and one disk, ParallelMultiSearcher is slower than MultiSearcher. I wrote an email "Who has sample code of remote multiple servers multiple indexes searching" yesterday. If you ha

How Can I let my many application can Know the Index change,but Not need re-open Index.

2007-05-24 Thread 童小军
I have some application will indexing new data to one Index Directory. And other some application will read the index and Data Mining. But my Mining Application must re-open the index Directory. The Index file have 5G . and must real time mining . How Can I do it at many computer at one n

Re: How Can I let my many application can Know the Index change,but Not need re-open Index.

2007-05-24 Thread Stephen Gray
Hi, My understanding is that once you have added documents to your index you need to close and reopen your IndexReader and Searcher, otherwise the documents added will not be available to these. You might want to try LuceneIndexAccessor (http://www.blizzy.de/lucene/lucene-indexaccess-0.1.0.zip) w

Re: maxDoc and arrays

2007-05-24 Thread Antony Bowesman
Carlos Pita wrote: Hi all, Is there any guaranty that the maxDoc returned by a reader will be about the total number of indexed documents? It struck me in this thread was that there may be a misunderstanding of the relationship between numDocs/maxDoc and an IndexReader. When an IndexReade

Re: maxDoc and arrays

2007-05-24 Thread Carlos Pita
I see. Anyway I would update the array when adding a document, so my reader would be closed then, and just a writer would be accessing the index. Supposing that no merging is triggered (for this I'm choosing a big mergeFactor and forcing optimization when a number of documents has been added) the