Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread Varun Dhussa
Hi, The details are as follows: Solaris version: Solaris 10 U5 and U6 For the Java Setup, I have tried with: Sun JDK 1.5 (32 & 64) Sun JDK 1.6 (32 & 64) Heap Space: 2G from 32 bit and 4G for 64 bit (Set the same values for both XMS and XMX) Disk: Tried with ZFS (U6) and UFS (U5) I reduced the

query regarding indexing method

2009-02-18 Thread nitin gopi
Hello all, I want to know what algorithm lucene uses for indexing documents. Can I use lucene in my application with my own algorithm for indexing? regards, Nitin Gopi

Re: what's the best practice for getting "next page" of hits?

2009-02-18 Thread Ganesh
Your solution (b) is better rather than using your own way of paging. Do search for every page and collect the (pageno * count) results, discard (pageno-1 * count) and display the last count results to the User. This is fast and efficient. Regards Ganesh - Original Message - From:

what's the best practice for getting "next page" of hits?

2009-02-18 Thread rolarenfan
R2.4 So, I may well be missing something here, but: I use IndexSearcher.search(someQuery, null, count, new Sort()); to get an instance of TopFieldDocs (the "Hits" is deprecated). So far, all fine; I get a bunch of documents. Now, what is the Lucene-best-practice for getting the *next* batch

Re: newbie seeking explanation of semantics of "Field" class

2009-02-18 Thread Otis Gospodnetic
Or: // store and index this field to allow original field content retrieval and search against it myDocument.add(new Field("contents", theFullDocumetText, Field.Store.COMPRESS, Field.Index.ANALYZED)); Otis-- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch _

Re: newbie seeking explanation of semantics of "Field" class

2009-02-18 Thread rolarenfan
Thanks to Erick, Matthew, and Uwe -- that does help, a lot. E.g., one bit of code I had (mostly copied) now makes more sense: // add this field, to allow retrieving the full-text: myDocument.add(new Field("contents", theFullDocumetText, Field.Store.COMPRESS, Field.Index.NO)); // add this fiel

Re: TopDocCollector vs Hits: TopDocCollector slowing....

2009-02-18 Thread AlexElba
Grant Ingersoll-6 wrote: > > I presume they are both now slower, right? Otherwise you wouldn't > mind the speedup on the bigger one. Hits did caching and prefetched > things, which has it's tradeoffs. Can you describe how you were > measuring the queries? How many results were you get

Re: Lunene 2.3-2.4 switch: Scoring change

2009-02-18 Thread AlexElba
AlexElba wrote: > > Hello, > I have project which I am trying to switch from lucene 2.3.2 to 2.4 I am > getting some strange scores > > Before my code was: > > Hits hits= searcher.search(query); > Float score = hits.score(1) > > and scores from hist was from 0-1; 1 was 100% match > > I chan

RE: Hebrew and Hindi analyzers

2009-02-18 Thread Zhang, Lisheng
Thanks very much for helps! -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, February 17, 2009 9:48 PM To: java-user@lucene.apache.org Subject: Re: Hebrew and Hindi analyzers hey i've played around with trying to get towards a reasonable gpl hebrew analyzer f

Re: the impact of thousands of field in a single document

2009-02-18 Thread Yonik Seeley
On Wed, Feb 18, 2009 at 3:26 AM, wrote: > Due to requirement, we need to construct a Lucene document with tens of > thousands of Field. Did anyone try this? What's the performance penalty > comparing with one single field to store all tokens for both indexing > and searching? It's doable. Search

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread Michael Stoppelman
Fuzzy search tends to be super heavy on CPU because of the Levenstein distance algo. We use it for a small index 60MB for spell correcting and our QPS suffers as a result. There was recently a discussion of a new fuzzy algorithm: https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian

Re: stream of events never to know when it ends? how to index such things & search

2009-02-18 Thread Erick Erickson
You could always sort by EVENTID, that way at least you'd have all the events for a particular ID together in your results. You'd have to post-filter the results to determine whether all the necessary descriptions were present. But I don't think this works all that well because, as you pointed out,

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread Glen Newton
Could you give some configuration details: - Solaris version - Java VM version, heap size, and any other flags - disk setup You should also consider using huge pages (see http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html) I will also be posting performance gains using

Change the boosting in search-time

2009-02-18 Thread Haroldo Nascimento
Hi I have I request in my search projetct that I do not know if it is possible to do easily: If occur matching exact using PharseQuery for example I must add for this boosting of this field a value of the other field of the document (example price), but if occur matching partial I must

Re: Identify the fields with matching only

2009-02-18 Thread Grant Ingersoll
You can use the explain() method or you can use the Highlighter, but both aren't perfect in this regards. You can also look into using SpanQueries, which give you positional information about where matches take place. This would require you switching how you generate queries. There is als

Lucene Boot Camp Training at ApacheCon Europe

2009-02-18 Thread Grant Ingersoll
Hi Lucene Users, For those who don't already know, I will be offering a two day Lucene Boot Camp training at ApacheCon Europe on March 23 and 24. The two day class covers a lot of detail on how to use Lucene to build search applications, including the basics of searching, indexing and an

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread eks dev
Have you tried NGram SpellChecker + Query expansion? This is quite similar to your proposal, you have your priority queue in SpellChecker - Original Message > From: mark harwood > To: java-user@lucene.apache.org > Sent: Wednesday, 18 February, 2009 11:54:18 > Subject: Re: Lucene sear

stream of events never to know when it ends? how to index such things & search

2009-02-18 Thread Christian Brennsteiner
dear lucene community, i am playing around with lucene right now. and have come to very bad problem. given environment: a signal source gives signals with eventids ans eventdescriptions for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT" those events can be running very long (e.g. one

Re: Phrase indexing and searching with Lucene

2009-02-18 Thread Erick Erickson
I'm still not clear why the built-in phrase query syntax won't work. If I index the following terms (erick, erickson, thinks, small, thoughts) in a single field, then searching for "erick erickson" (as a phrase query, i.e. with double quotes when sent through a query parser or constructing a Phrase

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread Varun Dhussa
The method suggested would make the speed faster, but I doubt whether it would be substantial on processors with slower clock speed. Keeping in mind that most processors are going multi-core, it would make sense to multi-thread the scan. Any remarks are welcome! Varun Dhussa Product Architect

RE: Phrase indexing and searching with Lucene

2009-02-18 Thread Nada Mimouni
Thank you Erick. I need first to index phrases, the built-in phrase processing (with double quotes) comes in the search step. Is there any difference between : 1) start by indexing phrases and then make a phrase search 2) index terms and then search for phrases To

Re: Reporting indexed metadata

2009-02-18 Thread Erick Erickson
I don't understand your question. Metadata about what? The Fields in the document? The number of terms in a field? The most frequent word in the index? in the Document? If you elucidated the problem you're trying to solve you'd probably get better answers Best Erick On Wed, Feb 18, 2009 at 7

Re: Phrase indexing and searching with Lucene

2009-02-18 Thread Erick Erickson
Have you tried the built-in phrase processing with double quotes? e.g. "this is a phrase"? See the Term section at http://lucene.apache.org/java/2_4_0/queryparsersyntax.html Best Erick On Wed, Feb 18, 2009 at 5:57 AM, Nada Mimouni < mimo...@tk.informatik.tu-darmstadt.de> wrote: > > > Hello ever

Reporting indexed metadata

2009-02-18 Thread Tod
Is there a way I can ask lucene what metadata elements it knows of and is storing in its index? Thanks - Tod - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@luc

Re: the efficiency of creating indexes

2009-02-18 Thread Michael McCandless
If not for merging, I believe indexing is simply linear. Merging adds only a logarithmic (in total index size) cost. Using as large an IndexWriter RAM buffer as you can will minimize the amount of merging. (Also increasing mergeFactor, or decreasing maxMergeMB/Docs, but these will impact s

Phrase indexing and searching with Lucene

2009-02-18 Thread Nada Mimouni
Hello everybody, I use Lucene to index and search into text documents. At present, I just index and search for single words. I want to extend this to phrases (or nGrams). Could anyone please give me details on how to index phrases and then make a phrase search? Thank you very much in advanc

Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread mark harwood
I was having some thoughts recently about speeding up fuzzy search. The current system does edit-distance on all terms A-Z, single threaded. Prefix length can reduce the search space and there is a "minimum similarity" threshold but that's roughly where we are. Multithreading this to make use o

Lucene search performance on Sun UltraSparc T2 (T5120) servers

2009-02-18 Thread Varun Dhussa
Hi, I have had a bad experience when migrating my application from Intel Xeon based servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just does not perform. A search which took approximately 500 ms takes more than 6 seconds to execute. The index has about 100,000,000 records. S

Phrase indexing and searching with Lucene

2009-02-18 Thread Nada Mimouni
Hello everybody, In my research work, I use Lucene to index and search into text documents. At present, I just index and search for single words. I want to extend this to phrases (or nGrams). Could anyone please give me more details on how to do it and also point me to some useful references o

the impact of thousands of field in a single document

2009-02-18 Thread Fang_Li
Hi, Due to requirement, we need to construct a Lucene document with tens of thousands of Field. Did anyone try this? What's the performance penalty comparing with one single field to store all tokens for both indexing and searching? Thanks, Li ---

RE: the efficiency of creating indexes

2009-02-18 Thread Fang_Li
Did you try? The cost of index merging grows when indexes are getting bigger. Try to limit the max document size in a segment by setting setMaxMergeDocs in IndexWriter. -Original Message- From: 治江 王 [mailto:wangzhijiang...@yahoo.com.cn] Sent: Monday, February 16, 2009 1:49 PM To: java-us