Re: how to estimate how much memory is required to support the large index search

2008-11-18 Thread Michael McCandless
BTW, upcoming changes in Lucene for flexible indexing should improve the RAM usage of the terms index substantially: https://issues.apache.org/jira/browse/LUCENE-1458 In the current (first) iteration on that patch, TermInfo is no longer used at all when loading the index. I think for

Re: Lucene 2.4 Token Stream error

2008-11-18 Thread Michael McCandless
Can you post the code fragment in AccentFilter.java that's setting the Token? In 2.4 we added that check (for IllegalArgumentException) to ensure you don't setTermLength to something longer than the current term buffer. You should call resizeTermBuffer() first, then fill in the char[]

Reopen IndexReader

2008-11-18 Thread Ganesh
Hello all, I am using version 2.4. The following code throws AlreadyClosedException IndexReader reader = searcher.getIndexReader(); IndexReader newReader = reader.reopen(); if (reader != newReader) { reader.close(); boolean isCurrent = newReader.isCurr

Re: Reopen IndexReader

2008-11-18 Thread Michael McCandless
Did you create your IndexSearcher using a String or File (not Directory)? If so, it sounds like you are hitting this issue (just fixed this morning, on 2.9-dev (trunk)): https://issues.apache.org/jira/browse/LUCENE-1453 The workaround is to use the Directory ctor of IndexSearcher. M

Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

2008-11-18 Thread Sascha Fahl
Hi, what is the best to transform the german umlaute ö,ä,ü,ß into oe, ae, ue, ss during the process of analyzing? Thanks, Sascha Fahl Softwareentwicklung evenity GmbH Zu den Mühlen 19 D-35390 Gießen Mail: [EMAIL PROTECTED] --

Re: Reopen IndexReader

2008-11-18 Thread Ganesh
I am creating IndexSearcher using String, this is working fine with version 2.3.2. I tried by replacing Directory ctor of IndexSearcher and it is working fine with v2.4. I have recently upgraded from v2.3.2 to 2.4. Is v2.4 stable and i could more forward with this or shall i revert back to 2.3

AW: Transforming german umlaute like ö,ä,ü ,ß into oe, ae, ue, ss

2008-11-18 Thread Uwe Goetzke
Use ISOLatin1AccentFilter, although it is not perfect... So I made ISOLatin2AccentFilter for me and changed this method. We use our own analysers, so you would use something like this result = new org.apache.lucene.analysis.WhitespaceTokenizer(reader); result = new

Re: Reopen IndexReader

2008-11-18 Thread Michael McCandless
Well... we certainly do our best to have each release be stable, but we do make mistakes, so you'll have to use your own judgement on when to upgrade. However, it's only through users like yourself upgrading that we then find & fix any uncaught issues in each new release. Mike Ganesh w

Re: AW: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

2008-11-18 Thread Koji Sekiguchi
Uwe Goetzke wrote: > Use ISOLatin1AccentFilter, although it is not perfect... > So I made ISOLatin2AccentFilter for me and changed this method. Or use CharFilter library. It is for Solr as of now, though. See: https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG

Re: AW: Transforming german umlaute like ö,ä, ü,ß into oe, ae, ue, ss

2008-11-18 Thread Sascha Fahl
Where do I get the CharFilter library? I'm using Lucene, not Solr. Thanks, Sascha Am 18.11.2008 um 14:11 schrieb Koji Sekiguchi: Uwe Goetzke wrote: > Use ISOLatin1AccentFilter, although it is not perfect... > So I made ISOLatin2AccentFilter for me and changed this method. Or use CharFilter li

Re: how to estimate how much memory is required to support the large index search

2008-11-18 Thread Zhibin Mai
You are right. Cheers, Zhibin From: Chris Lu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, November 17, 2008 11:13:44 PM Subject: Re: how to estimate how much memory is required to support the large index search So looks like you are not

Re: AW: Transforming german umlaute like ö,ä ,ü,ß into oe, ae, ue, ss

2008-11-18 Thread Koji Sekiguchi
Sascha Fahl wrote: Where do I get the CharFilter library? I'm using Lucene, not Solr. Thanks, Sascha CharFilter is included in recent Solr nightly build. It is not OOTB solution for Lucene now, sorry. If I have time, I will make it for Lucene in this weekend. Koji --

RE: Transforming german umlaute like ö,ä,ü ,ß into oe, ae, ue, ss

2008-11-18 Thread Teruhiko Kurosaka
Naming this class to include "Latin2" may be misleading. Latin2 means ISO-8859-2 character set. http://en.wikipedia.org/wiki/ISO_8859-2 > From: Uwe Goetzke [mailto:[EMAIL PROTECTED] > Sent: Tuesday, November 18, 2008 7:26 AM > To: java-user@lucene.apache.org > Cc: [EMAIL PROTECTED] > Subject: A

Special characters prevent entity being indexed

2008-11-18 Thread Pekka Nykyri
Hi! I'm having problems with entities including special characters (Spanish language) not getting indexed. I haven't been able to find the the reason why some entities get indexed while some don't. I have 3 fields that (currently) hold the same value. The value for the fields is example "¡

Re: Special characters prevent entity being indexed

2008-11-18 Thread Erick Erickson
What analyzer are you using at index and search time? Typical problems include: using an analyzer that doesn't understand accented chars (StandardAnalyzer for instance) using a different anlyzer during search and index. Search the user list for "accent" and you'll find this kind of problem discuss

compare scores across queries

2008-11-18 Thread Ng Vinny
Hi all, I am wondering if the raw scores obtained from HitCollector can be used to compare relevance of documents to different queries? E.g. two phrase queries are issued : (PQ1: "Barack Obama" and PQ2: "John McCain"). if a document (doc1) belongs to the result sets of both queries and has th

can I set Boost to the term while indexing?

2008-11-18 Thread T. H. Lin
I would like to store a set of keywords in a single field of a document. for example I have now three keywords: "One", "Two" and "Three" and I am going to add them into a document. At first, is this code correct? // String[] keyword

Searching repeating fields

2008-11-18 Thread Mark Ferguson
Hello, I am designing an index in which one url corresponds to one document. Each document also contains multiple parallel repeating fields. For example: Document 1: url: http://www.cnn.com/ page_description: cnn breaking news page_title: news page_title: cnn news page_titel: homepage

Re: Searching repeating fields

2008-11-18 Thread Ian Lea
How about using variable field names? url: http://www.cnn.com/ page_description: cnn breaking news page_title_ajax: news page_title_paris: cnn news page_title_daniel: homepage username: ajax username: paris username: daniel and search for +user:ajax +page_title_ajax:news or maybe just pag

Re: constructing a mini-index with just the number of hits for a term

2008-11-18 Thread Michael McCandless
Flexible indexing (LUCENE-1458) should make this possible. IE you could use your own codec which discards doc/freq/prox/payload and during indexing (for this one field) and simply stores the term frequency in the terms dict. However, one problem will be deletions (in case it matters to yo

Re: Searching repeating fields

2008-11-18 Thread Mark Ferguson
Thanks for the suggestion, but I think I will need a more robust solution, because this will only work with pairs of fields. I should have specified that the example I gave was somewhat contrived, but in practice there could be more than two parallel fields. I'm trying to find a general solution th

Re: Searching repeating fields

2008-11-18 Thread Mark Ferguson
I'll provide a better example, perhaps it will help in formulating a solution. Suppose I am designing an index that stores invoices. One document corresponds to one invoice, which has a unique id. Any number of employees can make comments on the invoices, and comments have different classification

Re: Searching repeating fields

2008-11-18 Thread Chris Hostetter
There has been discussion in the past about how PhraseQuery artificially requires that the Terms you add to it must be in the same field ... you could theoretically modify PhraseQuery to have a tpe of query that required terms in one field be withing (slop)N positions of a term in a "parallel"

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
I've finished a query time implementation of a column stride filter, which implements DocIdSetIterator. This just builds the filter at process start and uses it for each subsequent query. The index itself is unchanged. The results are very impressive. Here are the results on a 45M document index:

Re: Term numbering and range filtering

2008-11-18 Thread Paul Elschot
Op Wednesday 19 November 2008 00:43:56 schreef Tim Sturge: > I've finished a query time implementation of a column stride filter, > which implements DocIdSetIterator. This just builds the filter at > process start and uses it for each subsequent query. The index itself > is unchanged. > > The resul

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
> With "Allow Filter as clause to BooleanQuery": > https://issues.apache.org/jira/browse/LUCENE-1345 > one could even skip the ConstantScoreQuery with this. > Unfortunately 1345 is unfinished for now. > That would be interesting; I'd like to see how much performance improves. >> startup: 2811

Re: InstantiatedIndex help + first impression

2008-11-18 Thread karl wettin
The actual performance depends on how much you load to the index. Can you tell us how many documents and how large these documents are that you have in your index? Compared with RAMDirectory I'vee seen performance boosts of up to 100x in a small index that contains (1-20) Wikipedia sized document

2.4 Performance

2008-11-18 Thread lucene
On an index of around 20 gigs I've been seeing a performance drop of around 35% after upgrading to 2.4 (measured on ~1 requests identical requests, executed in parallel against a threaded lucene / apache setup, after a roughly 1 query warmup). The principal changes I've made so far are just

Re: InstantiatedIndex help + first impression

2008-11-18 Thread karl wettin
On Wed, Nov 19, 2008 at 3:27 AM, karl wettin <[EMAIL PROTECTED]> wrote: > rewritten query. I.e. this is probably as much a store related expense > as it is a Levenshtein calculation expense. "this is probably *not* as much a store related.." that is. karl ---

Re: Reopen IndexReader

2008-11-18 Thread Cool The Breezer
I had same kind of problem and I somehow managed to find a work around by initializing IndexSearcher from new reader. try { IndexReader newReader = reader.reopen(); if (newReader != reader) { // reader was reopened

Reg two versions of lucene on the same machine

2008-11-18 Thread Shireesha.Katkoor
Hi, I am trying to upgrade the version of Lucene from 1.2 to 2.4. Can we do this directly? Is it possible to have two versions of Lucene on the same machine.? Shireesha This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain

Re: Reg two versions of lucene on the same machine

2008-11-18 Thread Anshum
Hi Shireesha, I'm not sure as to what is it that you have been using, but 'm kinda sure that you'd have to check for deprecated things as well as improved ones while upgrading.. 1.2 to 2.4 is a huge jump certainly, with compound index structure etc. coming into place. You would have to try it and c