Re: SweetSpotSimilarity

2012-03-06 Thread Paul Taylor
On 05/03/2012 19:26, Chris Hostetter wrote: : very small to occasionally very large. It also might be the case that : cover letters and e-mails while short might not be really something to : heavily discount. The lower discount range can be ignored by setting : the min of any sweet spot to 1.

Is Java 7 now safe with Lucene?

2012-03-06 Thread Chris Bamford
Hi there, Is Java7 now safe to use with Lucene? If so, is there a minimum Lucene version I must use with it? Thanks, - Chris

Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
I've posted a self-contained test case to github of a mystery. git://github.com/bimargulies/lucene-4-update-case.git The code can be seen at https://github.com/bimargulies/lucene-4-update-case/blob/master/src/test/java/org/apache/lucene/BadFieldTokenizedFlagTest.java. I write a doc to an index,

RE: Is Java 7 now safe with Lucene?

2012-03-06 Thread Uwe Schindler
Hi, Any version of Lucene should be compatible with Java 7, if you use at least JDK7 update 1. There are some minor issues with older Lucene versions when *building* the package and *running tests*, but the precompiled binaries are fine. But you should use Lucene/Solr 3.5 as a minimum, as this one

A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Under "LUCENE-1458, LUCENE-2111: Flexible Indexing", CHANGES.txt appears to be missing one critical hint. If you have existing code that called IndexReader.terms(), where do you start to get a FieldsEnum? - To unsubscribe, e-mail:

RE: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Uwe Schindler
AtomicReader.fields() - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Benson Margulies [mailto:bimargul...@gmail.com] > Sent: Tuesday, March 06, 2012 2:50 PM > To: java-user@lucene.apache.org > Subject:

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Michael McCandless
I think MIGRATE.txt talks about this? Mike McCandless http://blog.mikemccandless.com On Tue, Mar 6, 2012 at 8:50 AM, Benson Margulies wrote: > Under "LUCENE-1458, LUCENE-2111: Flexible Indexing", CHANGES.txt > appears to be missing one critical hint. If you have existing code > that called Inde

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler wrote: > AtomicReader.fields() I went and read up AtomicReader in CHANGES.txt. Should I call SegmentReader.getReader(IOContext)? I just posted a patch to CHANGES.txt to clarify before I read your email, shall I improve it to use this instead of

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:09 AM, Michael McCandless wrote: > I think MIGRATE.txt talks about this? Yes it does, but it doesn't actually answer the specific question. See LUCENE-3853 where I added what seems to be missing. If it's somewhere else in the file I apologize. > > Mike McCandless > > htt

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Oh, I see, I didn't read far enough down. Well, the patch still repairs a bug in the code fragment relative to the Term enumeration. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Robert Muir
I think the issue is that your analyzer is standardanalyzer, yet field text value is "value-1" So standardanalyzer will tokenize this into two terms: "value" and "1" But later, you proceed to do TermQueries on "value-1". This term won't exist... TermQuery etc that take Term don't analyze any text

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
Oh, ouch, there's no SegmentReader.getReader, I was reading IndexWriter. Sorry. On Tue, Mar 6, 2012 at 9:14 AM, Benson Margulies wrote: > On Tue, Mar 6, 2012 at 8:56 AM, Uwe Schindler wrote: >> AtomicReader.fields() - To unsubs

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir wrote: > I think the issue is that your analyzer is standardanalyzer, yet field > text value is "value-1" Robert, Why is this field analyzed at all? It's built with StringField.TYPE_STORED. I'll push another copy that shows that it works fine when the

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies wrote: > On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir wrote: >> I think the issue is that your analyzer is standardanalyzer, yet field >> text value is "value-1" > > Robert, > > Why is this field analyzed at all? It's built with StringField.TYPE_STO

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Robert Muir
On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies wrote: > On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir wrote: >> I think the issue is that your analyzer is standardanalyzer, yet field >> text value is "value-1" > > Robert, > > Why is this field analyzed at all? It's built with StringField.TYPE_STO

RE: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Uwe Schindler
Hi, MultiFields should only be used (as it is slow) if you exactly know what you are doing and what the consequences are. There is a change in Lucene 4.0, so you can no longer terms and postings from a top-level (composite) reader. More info is also here: http://goo.gl/lMKTM Uwe - Uwe Sch

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:34 AM, Uwe Schindler wrote: > Hi, > > MultiFields should only be used (as it is slow) if you exactly know what you > are doing and what the consequences are. There is a change in Lucene 4.0, so > you can no longer terms and postings from a top-level (composite) reader.

Apply custom tokenization

2012-03-06 Thread Carsten Schnober
Dear list, I have a quite specific issue on which I would appreciate very much having some thoughts before I start the actual implementation. Here's my task description: I would like to index corpora that have already been tokenized by an external tokenizer. This tokenization is stored in an extern

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Michael McCandless
Hmm something is up here... I'll dig. Seems like we are somehow analyzing StringField when we shouldn't... Mike McCandless http://blog.mikemccandless.com On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir wrote: > On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies > wrote: >> On Tue, Mar 6, 2012 at 9

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir wrote: > On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies > wrote: >> On Tue, Mar 6, 2012 at 9:20 AM, Robert Muir wrote: >>> I think the issue is that your analyzer is standardanalyzer, yet field >>> text value is "value-1" >> >> Robert, >> >> Why is

RE: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Uwe Schindler
Hi, The recommended way to get an atomic reader from a composite reader is to use SlowCompositeReaderWrapper.wrap(reader). MultiFields is now purely internal. I think it's only public because the codecs package may need it, otherwise it should be pkg-private. - Uwe Schindler H.-H.-Meier-Al

RE: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Uwe Schindler
String field is analyzed, but with KeywordTokenizer, so all should be fine. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Tuesday, March 06

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:47 AM, Uwe Schindler wrote: > String field is analyzed, but with KeywordTokenizer, so all should be fine. I filed LUCENE-3854. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 9:46 AM, Uwe Schindler wrote: > Hi, > > The recommended way to get an atomic reader from a composite reader is to use > SlowCompositeReaderWrapper.wrap(reader). MultiFields is now purely internal. > I think it's only public because the codecs package may need it, otherwise

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Robert Muir
Thanks Benson: look like the problem revolves around indexing Document/Fields you get back from IR.document... this has always been 'lossy', but I think this is a real API trap. Please keep testing :) On Tue, Mar 6, 2012 at 9:58 AM, Benson Margulies wrote: > On Tue, Mar 6, 2012 at 9:47 AM, Uwe S

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Benson Margulies
On Tue, Mar 6, 2012 at 10:04 AM, Robert Muir wrote: > Thanks Benson: look like the problem revolves around indexing > Document/Fields you get back from IR.document... this has always been > 'lossy', but I think this is a real API trap. > > Please keep testing :) Got a suggestion for sneaking arou

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Michael McCandless
On Tue, Mar 6, 2012 at 10:06 AM, Benson Margulies wrote: > On Tue, Mar 6, 2012 at 10:04 AM, Robert Muir wrote: >> Thanks Benson: look like the problem revolves around indexing >> Document/Fields you get back from IR.document... this has always been >> 'lossy', but I think this is a real API trap.

Filter the search based on the subset of docids.

2012-03-06 Thread Kushal Dave
I have an ID field that contains about 100,000 unique ids. If I want to query all records with ids [1-100], How should I be doing this? I tried doing it the following way: Query qry = new MultiFieldQueryParser( fi

Re: Filter the search based on the subset of docids.

2012-03-06 Thread Ian Lea
You'll need to pad your ids to make this work. 01 02 etc. with a length to match the max you require, now or in the future. Or, better, upgrade to a recent release and use NumericField. -- Ian. On Mon, Mar 5, 2012 at 9:46 PM, Kushal Dave wrote: > I have an ID field that contains abo

How disabling norms on a field effects other fields

2012-03-06 Thread Paul Taylor
I have a number of fields that either only ever have a term frequency of 1 or I don't want them to be disavantaged if they do have a greater term frequency, and I never boost the field so I disable norms for these fields with Field.Index.ANALYZED_NO_NORM or Field.Index.NOT_ANALYZED_NO_NORM. B

Re: How disabling norms on a field effects other fields

2012-03-06 Thread Paul Taylor
On 06/03/2012 21:44, Paul Taylor wrote: I have a number of fields that either only ever have a term frequency of 1 or I don't want them to be disavantaged if they do have a greater term frequency, and I never boost the field so I disable norms for these fields with Field.Index.ANALYZED_NO_NORM

Re: SweetSpotSimilarity

2012-03-06 Thread Paul Taylor
On 05/03/2012 23:24, Robert Muir wrote: On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill wrote: I would definitely not suggest using SSS for fields like legal brief text or emails where there is huge variability in the length of the content -- i can't think of any context where a "short" email is de

Re: SweetSpotSimilarity

2012-03-06 Thread Robert Muir
On Tue, Mar 6, 2012 at 5:57 PM, Paul Taylor wrote: >> Hello, >> >> what is previously Similarity in older releases is moved to >> TFIDFSimilarity: it extends Similarity and exposes a vector-space API, >> with its same formulas in the javadocs: >> >> https://builds.apache.org/view/G-L/view/Lucene/j

Re: How disabling norms on a field effects other fields

2012-03-06 Thread Hany Azzam
i.e. Field length :) A trivial question maybe: if one uses these flags does that mean they don't need to override the computeNorm method as shown in Simon's article on seachworkings? I am referring to the case when one doesn't want to use norms. h. -Original Message- From: Paul Taylor

More About NOT Optimizing

2012-03-06 Thread Paul Hill
I'm running with 3.4 code and have studied up on all the API related to the optimize() replacements and understand I needn't worry about deleted documents, but I still want to ask a few things about keeping the index in good shape And about merge policy. I have an index with 421163 documents (in