Re: What is the best way to split substring words

2007-05-20 Thread Soeren Pekrul
bhecht wrote: I want to be able to split tokens by giving a list of substring words. So I can give a list f subwords like: "strasse", "gasse", And the token "mainstrasse" or "maingasse" will be split to 2 tokens "main" and "strasse". IMBEMBA, PASQUALINO: A Splitter for German Compound Words. F

Re: One (large) field shared by many documents

2007-05-20 Thread Paul Elschot
On Sunday 20 May 2007 02:49, Peter Bloem wrote: > Ah, now we're getting somewhere. So I run the first query on the > collection index, get a set of collection id's from that. But how do I > use them in the second query on the document index? It should be easy > enough to retrieve all documents i

Command line search tool using Lucene (targeted for Site Search)

2007-05-20 Thread Saurabh Dani
Greetings All, We would like to introduce our java lucene based command line search tool, Minalyzer Lite. Minalyzer Lite ships with an indexing executable, which can index data from file system, databases, by crawling web sites or an ARC file (output of Heritrix Crawler). The end user does

Re: One (large) field shared by many documents

2007-05-20 Thread Erick Erickson
See Paul's e-mail, he's talking about a place I haven't been in Lucene yet. Other than that, see below On 5/19/07, Peter Bloem <[EMAIL PROTECTED]> wrote: Ah, now we're getting somewhere. So I run the first query on the collection index, get a set of collection id's from that. But how do I

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-20 Thread Andreas Guther
Thank you for the clarification. I assume that hits usually return in ranking order which makes sense in terms how one usually wants to display the result. In terms of access speed this is the non wanted order. Though it is not a big deal sorting the array it might be interesting thinking about

Re: One (large) field shared by many documents

2007-05-20 Thread Peter Bloem
Thanks for your reply. This is getting me much deeper into the uncharted territories of Lucene, especially the area of FieldCaches, but it's also piqued my curiosity. Most of what I've been able to find are discussions by people that are already using FieldCache, rather than explanations of wha

Re: One (large) field shared by many documents

2007-05-20 Thread Peter Bloem
My comments on storing document id's are perhaps based on a misguided view of lucene, but it's worth investigating. I figured since there's only one document per id in the document index, instead of executing one query with n OR clauses, you could execute n queries with a single docId to get al

Re: One (large) field shared by many documents

2007-05-20 Thread Paul Elschot
On Sunday 20 May 2007 19:52, Peter Bloem wrote: > Thanks for your reply. This is getting me much deeper into the uncharted > territories of Lucene, especially the area of FieldCaches, but it's also > piqued my curiosity. Most of what I've been able to find are discussions > by people that are al

Re: Memory leak (JVM 1.6 only)

2007-05-20 Thread Stephen Gray
Thanks, the link was helpful. I'll let you know if I find anything. Thanks for all the replies to this. Steve Doron Cohen wrote: Stephen Gray wrote: Thanks. If the extra memory allocated is native memory I don't think jconsole includes it in "non-heap" as it doesn't show this as increasin

Optional terms in BooleanQuery

2007-05-20 Thread Peter Bloem
I'm constructing a search with some required terms and some optional terms in in the query. According to some earlier posts that looks like "+(A B) C D E" in query syntax for required terms A and B and optional terms C D and E. In other words, Lucene considers all documents that have both A and

Re: Optional terms in BooleanQuery

2007-05-20 Thread Mark Miller
I like to think of it like this: Each doc is going to get a score -- if the score is positive the doc will be a hit, if the score is 0 the doc will not be a hit. If a boolean clause is Occur.Must and it is not found, the score will be dropped to 0 no matter what (if found, the score is obviou

Re: Field.Store.Compress - does it improve performance of document reads?

2007-05-20 Thread Erick Erickson
Have you tried the static Sort.INDEXORDER sort object in Lucene 2.1? Erick On 5/20/07, Andreas Guther <[EMAIL PROTECTED]> wrote: Thank you for the clarification. I assume that hits usually return in ranking order which makes sense in terms how one usually wants to display the result. In term

Re: Command line search tool using Lucene (targeted for Site Search)

2007-05-20 Thread James liu
it seems not quick. http://demo1.minalyzer.com/minalyzerlite/search4.php?q=test&offset=0 Results 1 - 15 of 16 for test.(1.586 seconds) 2007/5/20, Saurabh Dani <[EMAIL PROTECTED]>: Greetings All, We would like to introduce our java lucene based command line search tool, Minalyzer Lite.