[ANNOUNCE] Apache Lucene 4.2 released

2013-03-11 Thread Robert Muir
March 2013, Apache Lucene™ 4.2 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.2 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text sea

SpanQuery getSpan call in lucen 4

2013-03-11 Thread ash nix
Hi, I was following tutorail at http://searchhub.org/2009/05/26/accessing-words-around-a-positional-match-in-lucene/ for couting number of spans of a query in a document. But the defination of getSpan(IndexReader) in the SpanQuery is changed to getSpan(IndexReaderContext, Bits, Map) with no inform

Re: Rewrite for RegexpQuery

2013-03-11 Thread Michael Sokolov
On 03/11/2013 01:22 PM, Michael McCandless wrote: On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober wrote: Am 11.03.2013 13:38, schrieb Michael McCandless: On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler wrote: Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE,

RE: Rewrite for RegexpQuery

2013-03-11 Thread Uwe Schindler
If you are interested, here is the solution with the "fake" query as rewrite. Just use GetTermsRewrite as rewrite method. The MTQ then rewrites to TermHolderQuery (cast to that) and you can get the terms using getTerms(): /** A fake query that is just used to collect all term instances for the

RE: Rewrite for RegexpQuery

2013-03-11 Thread Uwe Schindler
I think we have here different problems: Carsten wants to just collect the terms a MTQ visits, so using BooleanQuery to do this is fine, unless you hit the limit. If you don’t execute the query, the limit can be as high as possible (but it’s a static limit affecting all instances). To do the sa

Re: Rewrite for RegexpQuery

2013-03-11 Thread Michael McCandless
On Mon, Mar 11, 2013 at 9:32 AM, Carsten Schnober wrote: > Am 11.03.2013 13:38, schrieb Michael McCandless: >> On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler wrote: >> >>> Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this >>> should work (after rewrite your query is a Boole

Re: Should heap size be proportionate to the size of the index I'm opening?

2013-03-11 Thread Gili Nachum
Great links. Thanks Ian. Good to know that Lucene v4, has a smaller heap foot print. On Mon, Mar 11, 2013 at 11:18 AM, Ian Lea wrote: > It's not that simple. More to do with number of terms than raw index > size. Of course your large index may well have more terms than a > smaller one. > > S

Re: Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Am 11.03.2013 13:38, schrieb Michael McCandless: > On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler wrote: > >> Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this >> should work (after rewrite your query is a BooleanQuery, which supports >> extractTerms()). > > ... as long a

Re: Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Am 11.03.2013 14:13, schrieb Uwe Schindler: >> Regarding the application of IndexSearcher.rewrite(Query) instead: I don't >> see a way to set the rewrite method there because the Query's rewrite >> method does not seem to apply to IndexSearcher.rewrite(). > > Replace: >> BooleanQuery bq = (Boolea

RE: Rewrite for RegexpQuery

2013-03-11 Thread Uwe Schindler
> Set terms = new HashSet<>(); > MultiTermQuery query = new RegexpQuery(new Term("text", query)); > query.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_RE > WRITE); > BooleanQuery bq = (BooleanQuery) query.rewrite(reader); > bq.extractTerms(terms); > > > Regarding the application of Index

Re: Rewrite for RegexpQuery

2013-03-11 Thread Michael McCandless
On Mon, Mar 11, 2013 at 7:08 AM, Uwe Schindler wrote: > Set the rewrite method to e.g. SCORING_BOOLEAN_QUERY_REWRITE, then this > should work (after rewrite your query is a BooleanQuery, which supports > extractTerms()). ... as long as you don't exceed the max number of terms allowed by BQ (10

Re: AutoSuggest with Query-Filters

2013-03-11 Thread Michael McCandless
On Mon, Mar 11, 2013 at 7:33 AM, Nils Knappmeier wrote: > Hi, > >> This is tricky. >> >> You could build a separate suggester per category/zip code (or, >> possibly prefix-code each suggestion with the category/zip code into >> one suggester), but likely this will blow up (ie, if the same >> sugge

Re: Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Am 11.03.2013 12:08, schrieb Uwe Schindler: > This works for this query, but in general you have to rewrite until it is > completely rewritten: A while loop that exits when the result of the rewrite > is identical to the original query. IndexSearcher.rewrite() does this for > you. > >> 3. Wri

Re: AutoSuggest with Query-Filters

2013-03-11 Thread Nils Knappmeier
Hi, This is tricky. You could build a separate suggester per category/zip code (or, possibly prefix-code each suggestion with the category/zip code into one suggester), but likely this will blow up (ie, if the same suggestion often appears across zip codes / categories). If your suggestions are

Re: Migrate/Upgrade frommLucene 2.3

2013-03-11 Thread Ramprakash Ramamoorthy
On Mon, Mar 11, 2013 at 3:41 PM, Uwe Schindler wrote: > In that case, it should be fine. Otherwise you would need to reindex. > > Thank you Uwe. > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original Message- >

RE: Rewrite for RegexpQuery

2013-03-11 Thread Uwe Schindler
Hi, > Hi, > I'm trying to get the terms that match a certain RegexpQuery. My (naive) > approach: > > 1. Create a RegexpQuery from the queryString (e.g. "abc.*"): > Query q = new RegexpQuery(new Term("text", queryString)); > > 2. Rewrite the Query using the IndexReader reader: > q = q.rewrite(rea

Re: AutoSuggest with Query-Filters

2013-03-11 Thread Michael McCandless
On Mon, Mar 11, 2013 at 6:31 AM, Nils Knappmeier wrote: > Dear all, > > I have a request to implement an auto-suggest feature for our lucene based > product. > We have upgraded to Lucene 4.1 and intend to use the AnalyzingSuggester, but > we cannot determine the correct way of using it for our req

Re: Rewrite for RegexpQuery

2013-03-11 Thread Michael McCandless
You could call the .getTermsEnum() on the query itself, and then step through the terms and save them? But this method is protected ... so you could make a subclass w/ a new method that calls it and returns it to you. Mike McCandless http://blog.mikemccandless.com On Mon, Mar 11, 2013 at 6:41 A

Rewrite for RegexpQuery

2013-03-11 Thread Carsten Schnober
Hi, I'm trying to get the terms that match a certain RegexpQuery. My (naive) approach: 1. Create a RegexpQuery from the queryString (e.g. "abc.*"): Query q = new RegexpQuery(new Term("text", queryString)); 2. Rewrite the Query using the IndexReader reader: q = q.rewrite(reader); 3. Write the ter

AutoSuggest with Query-Filters

2013-03-11 Thread Nils Knappmeier
Dear all, I have a request to implement an auto-suggest feature for our lucene based product. We have upgraded to Lucene 4.1 and intend to use the AnalyzingSuggester, but we cannot determine the correct way of using it for our request. We have problems with two aspects: 1) The suggester shou

RE: Migrate/Upgrade frommLucene 2.3

2013-03-11 Thread Uwe Schindler
In that case, it should be fine. Otherwise you would need to reindex. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Monday, March 11, 2013 8:42 AM > To:

Re: Migrate/Upgrade frommLucene 2.3

2013-03-11 Thread Ramprakash Ramamoorthy
On Mon, Mar 11, 2013 at 1:11 PM, Uwe Schindler wrote: > If you use StandardAnalyzer, you are in trouble unless you use > StandardAnalyzer with Version.LUCENE_23 and you are using non-western > language. If you change your code to use Version.LUCENE_41, you have to > reindex. > > Thank you Uwe. We

Re: Should heap size be proportionate to the size of the index I'm opening?

2013-03-11 Thread Ian Lea
It's not that simple. More to do with number of terms than raw index size. Of course your large index may well have more terms than a smaller one. See http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html and http://searchhub.org/2011/09/14/estimating-memory-and-storage-

Re: Lightweight detection of whether a keyword is CJK or not (language detection)

2013-03-11 Thread Gili Nachum
This character lies in the CJK_UNIFIED_IDEOGRAPHS_EXTENSION_A block. Added extensions detection, I assume (not really knowing) that all of these characters are not phonetic as well. import java.lang.Character.UnicodeBlock; import java.util.Arrays; import java.util.HashSet; import java.util.Set; i

RE: Migrate/Upgrade frommLucene 2.3

2013-03-11 Thread Uwe Schindler
If you use StandardAnalyzer, you are in trouble unless you use StandardAnalyzer with Version.LUCENE_23 and you are using non-western language. If you change your code to use Version.LUCENE_41, you have to reindex. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eM