Using TermVectorMapper to compute term frequency across documents

2009-10-12 Thread Thomas D'Silva
Hi, I am trying to compute the counts of terms of the documents returned by running a query using a TermVectorMapper. I was wondering if anyone knew if there was a faster way to do this rather than using a HashMap with a TermVectorMapper to store the counts of the terms and calling getTermF

RE: querying multi-value fields

2009-10-12 Thread Angel, Eric
Erick, Thank you. This is awesome. I got it to work by just setting slop to 1 and returning 10 in my analyzer.getPositionIncrementGap. Here are my tests in case anyone else is interested: public class TestPositionIncrementGap extends TestCase { Analyzer analyzer = new Keyword

Re: Realtime search best practices

2009-10-12 Thread Yonik Seeley
Good point on isCurrent - I think it should only be with respect to the latest index commit point? and we should clarify that in the javadoc. [...] > // but what does the nrtReader say? > // it does not have access to the most recent commit > // state, as there's been a commit (with documents) > /

Re: Realtime search best practices

2009-10-12 Thread melix
Ok, thanks for the details. I see I'm not the only one finding the javadoc hard to understand. While this is well documented, it's still not clear enough about the exact semantics of "changes" : at first I thought it returned an IndexReader on the *uncommited changes only*, which meant it did not

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
I still see some things we might want to document or explain: We still need to be careful what the call to "isCurrent()" will mean in the future for IndexReaders - as now there is another kind of "current" - "current even up to uncommitted changes". Imagine the following set of IndexReaders float

Re: faceted search performance

2009-10-12 Thread Christoph Boosz
Hi Paul, Thanks for your suggestion. I will test it within the next few days. However, due to memory limitations, it will only work if the number of hits is small enough, am I right? Chris 2009/10/12 Paul Elschot > Chris, > > You could also store term vectors for all docs at indexing > time, a

Re: Realtime search best practices

2009-10-12 Thread John Wang
I think it was my email Yonik responded to and he is right, I was being lazy and didn't read the javadoc very carefully.My bad. Thanks for the javadoc change. -John On Mon, Oct 12, 2009 at 1:57 PM, Yonik Seeley wrote: > On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix > wrote: > > It may be surpri

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
On Mon, Oct 12, 2009 at 1:57 PM, Yonik Seeley wrote: > On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix > wrote: > > It may be surprising, but in fact I have read that > > javadoc. > > It was not your email I responded to. > Sorry, my bad then - you said "guys" and John and I were the last two to b

Re: Realtime search best practices

2009-10-12 Thread Michael McCandless
OK I just committed it -- thanks! Mike On Mon, Oct 12, 2009 at 5:01 PM, Jake Mannix wrote: > That seems a lot more straightforward Mike, thanks. > >  -jake > > On Mon, Oct 12, 2009 at 1:56 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> I agree, the javadocs could be improved.

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
That seems a lot more straightforward Mike, thanks. -jake On Mon, Oct 12, 2009 at 1:56 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > I agree, the javadocs could be improved. How about something like > this for the first 2 paragraphs: > > * Returns a readonly reader, covering

Re: Realtime search best practices

2009-10-12 Thread Yonik Seeley
On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix wrote: >  It may be surprising, but in fact I have read that > javadoc. It was not your email I responded to. >  It talks about not needing to close the > writer, but doesn't specifically talk about the what > the relationship between commit() calls a

Re: Realtime search best practices

2009-10-12 Thread Michael McCandless
I agree, the javadocs could be improved. How about something like this for the first 2 paragraphs: * Returns a readonly reader, covering all committed as * well as un-committed changes to the index. This * provides "near real-time" searching, in that changes * made during an IndexWri

Re: faceted search performance

2009-10-12 Thread Paul Elschot
Chris, You could also store term vectors for all docs at indexing time, and add the termvectors for the matching docs into a (large) map of terms in RAM. Regards, Paul Elschot On Monday 12 October 2009 21:30:48 Christoph Boosz wrote: > Hi Jake, > > Thanks for your helpful explanation. > In fac

Re: Realtime search best practices

2009-10-12 Thread Jason Rutherglen
Hi Cedric, There is a wiki page on NRT at: http://wiki.apache.org/lucene-java/NearRealtimeSearch Feel free tp ask questions if there's not enough information. -J On Mon, Oct 12, 2009 at 2:24 AM, melix wrote: > > Hi, > > I'm going to replace an old reader/writer synchronization mechanism we had

Re: querying multi-value fields

2009-10-12 Thread Erick Erickson
<<>> Not quite. Starting with the second add, a call will be made to getPositionIncrementGap in your analyzer. If you return a number larger than one, then the offsets between the last term of the preceeding add and the first term of this add will be that number. If you do nothing with getPositionI

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
Thanks Yonik, It may be surprising, but in fact I have read that javadoc. It talks about not needing to close the writer, but doesn't specifically talk about the what the relationship between commit() calls and getReader() calls is. I suppose I should have interpreted: "@returns a new reader

RE: querying multi-value fields

2009-10-12 Thread Angel, Eric
I need to analyze these values since I also want the benefits porterStemmer. The problem with using PhraseQuery is that I don't always know the slop. I may have values like "value4 ddd aaa". It's a tricky problem because I think Lucene sees all these values as one long value for the field "optio

Re: Realtime search best practices

2009-10-12 Thread Yonik Seeley
Guys, please - you're not new at this... this is what JavaDoc is for: /** * Returns a readonly reader containing all * current updates. Flush is called automatically. This * provides "near real-time" searching, in that changes * made during an IndexWriter session can be made * a

Re: querying multi-value fields

2009-10-12 Thread Jake Mannix
Or else just make sure that you use PhraseQuery to hit this field when you want "value1 aaa". If you don't tokenize these pairs, then you will have to do prefix/wildcard matching to hit just "value1" by itself (if this is allowed by your business logic). -jake On Mon, Oct 12, 2009 at 1:21 PM,

Re: querying multi-value fields

2009-10-12 Thread Adriano Crestani
Hi Eric, To achieve what you want, do not tokenize the values you query/add to this field. On Mon, Oct 12, 2009 at 4:05 PM, Angel, Eric wrote: > I have documents that store multiple values in some fields (using the > document.add(new Field()) with the same field name). Here's what a > typical

Re: Realtime search best practices

2009-10-12 Thread John Wang
Oh, that is really good to know! Is this deterministic? e.g. as long as writer.addDocument() is called, next getReader reflects the change? Does it work with deletes? e.g. writer.deleteDocuments()? Thanks Mike for clarifying! -John On Mon, Oct 12, 2009 at 12:11 PM, Michael McCandless < luc...@mik

querying multi-value fields

2009-10-12 Thread Angel, Eric
I have documents that store multiple values in some fields (using the document.add(new Field()) with the same field name). Here's what a typical document looks like: doc.option="value1 aaa" doc.option="value2 bbb" doc.option="value3 ccc" I want my queries to only match individual values,

Re: new sorting api and some perf numbers

2009-10-12 Thread Bradford Stephens
Wow! This is awesome. Can't wait to see how it plays with Bobo :) On Sun, Oct 11, 2009 at 10:19 PM, John Wang wrote: > Hi guys: >    The new FieldComparator api looks really scary :) > >    But after some perf testing with numbers I'd like to share, I guess it > is worth it: > > HW: Mac Pro with

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
On Mon, Oct 12, 2009 at 12:26 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > On Mon, Oct 12, 2009 at 3:17 PM, Jake Mannix > wrote: > > > Wait, so according to the javadocs, the IndexReader which you got from > > the IndexWriter forwards calls to reopen() back to > IndexWriter.getRea

Re: faceted search performance

2009-10-12 Thread Christoph Boosz
Hi Jake, Thanks for your helpful explanation. In fact, my initial solution was to traverse each document in the result once and count the contained terms. As you mentioned, this process took a lot of memory. Trying to confine the memory usage with the facet approach, I was surprised by the decline

Re: Realtime search best practices

2009-10-12 Thread Michael McCandless
On Mon, Oct 12, 2009 at 3:17 PM, Jake Mannix wrote: > Wait, so according to the javadocs, the IndexReader which you got from > the IndexWriter forwards calls to reopen() back to IndexWriter.getReader(), > which means that if the user has a NRT reader, and the user keeps calling > reopen() on it,

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
Wait, so according to the javadocs, the IndexReader which you got from the IndexWriter forwards calls to reopen() back to IndexWriter.getReader(), which means that if the user has a NRT reader, and the user keeps calling reopen() on it, they're getting uncommitted changes as well, while if they cal

Re: Realtime search best practices

2009-10-12 Thread Michael McCandless
Just to clarify: IndexWriter.newReader returns a reader that searches uncommitted changes as well. Ie, you need not call IndexWriter.commit to make the changes visible. However, if you're opening a reader the "normal" way (IndexReader.open) then it is necessary to first call IndexWriter.commit.

RE: How do you properly use NumericField

2009-10-12 Thread Uwe Schindler
The source code attachment got somehow lost: import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.NumericField; import org.apache.lucene.index.IndexWriter; import org.apa

RE: How do you properly use NumericField

2009-10-12 Thread Uwe Schindler
Hallo Paul, I implemented what you wanted in the applied testcase. Works without problems. Your error was, that in the TermQuery creation you placed a precisionStep in the shift value parameter which is incorrect. By the way: Lucene 2.9.1 and Lucene 3.0 will be optimized for ranges like [1 TO 1],

Re: faceted search performance

2009-10-12 Thread Jake Mannix
Hey Chris, On Mon, Oct 12, 2009 at 10:30 AM, Christoph Boosz < christoph.bo...@googlemail.com> wrote: > Thanks for your reply. > Yes, it's likely that many terms occur in few documents. > > If I understand you right, I should do the following: > -Write a HitCollector that simply increments a coun

Re: faceted search performance

2009-10-12 Thread Christoph Boosz
Thanks for your reply. Yes, it's likely that many terms occur in few documents. If I understand you right, I should do the following: -Write a HitCollector that simply increments a counter -Get the filter for the user query once: new CachingWrapperFilter(new QueryWrapperFilter(userQuery)); -Create

Re: Realtime search best practices

2009-10-12 Thread Jake Mannix
Hi Cedric, I don't know of anyone with a substantial throughput production system who is doing realtime search with the 2.9 improvements yet (and in fact, no serious performance analysis has been done on these even "in the lab" so to speak: follow https://issues.apache.org/jira/browse/LUCENE-157

Re: Getting left and right offsets of term search results

2009-10-12 Thread Till Kolter
Thanks a lot. I think TermPositionsVector will solve my problem. Although it seems to be a little inperformant Concerning the term representation: our data is way more complex then just phrasal annotation, it was just an example, because I am not allowed to talk about our internal organisation. I

Re: faceted search performance

2009-10-12 Thread John Wang
Given you have 1M docs and about 1M terms, do you see very few docs per term? If your DocSet per term is very sparse, BitSet is probably not a good representation. Simple int array maybe better for memory, and faster for iterating. -John On Mon, Oct 12, 2009 at 8:45 AM, Paul Elschot wrote: > On

Re: How do you properly use NumericField

2009-10-12 Thread Paul Taylor
Uwe Schindler wrote: Can you print the upper and lower term or the term you received in newRangeQuery and newTermQuery also to System.out? Maybe it is converted somehow by your Analyzer, that is used for parsing the query. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.theta

RE: How do you properly use NumericField

2009-10-12 Thread Uwe Schindler
Can you print the upper and lower term or the term you received in newRangeQuery and newTermQuery also to System.out? Maybe it is converted somehow by your Analyzer, that is used for parsing the query. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thet

Re: faceted search performance

2009-10-12 Thread Paul Elschot
On Monday 12 October 2009 14:53:45 Christoph Boosz wrote: > Hi, > > I have a question related to faceted search. My index contains more than 1 > million documents, and nearly 1 million terms. My aim is to get a DocIdSet > for each term occurring in the result of a query. I use the approach > descr

faceted search performance

2009-10-12 Thread Christoph Boosz
Hi, I have a question related to faceted search. My index contains more than 1 million documents, and nearly 1 million terms. My aim is to get a DocIdSet for each term occurring in the result of a query. I use the approach described on http://sujitpal.blogspot.com/2007/04/lucene-search-within-sear

Re: Usage of Lucene/Hibernate Search for Contacts Merging operation

2009-10-12 Thread Rene Wiermer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 nitingupta183 schrieb: > Hi all, > > I am supposed to add a feature in which my app will detect the duplicate > contacts of a user on the basis of their name, email, mobile number > etc.(i.e. Contacts Duplicate Killer kind of feature). The simplest al

Usage of Lucene/Hibernate Search for Contacts Merging operation

2009-10-12 Thread nitingupta183
Hi all, I am supposed to add a feature in which my app will detect the duplicate contacts of a user on the basis of their name, email, mobile number etc.(i.e. Contacts Duplicate Killer kind of feature). The simplest algo i can think of is find all the contacts on the basis of their name, email an

Re: Lucene

2009-10-12 Thread Ian Lea
You are storing this field without analysis, correctly as you want exact matches only, but using StandardAnalyzer at query time. Use PerFieldAnalyzerWrapper, specifying KeywordAnalyzer for this field. Using MultiFieldQueryParser may not make much sense here. -- Ian. On Mon, Oct 12, 2009 at 11

Re: How do you properly use NumericField

2009-10-12 Thread Paul Taylor
Uwe Schindler wrote: I forgot: The format of numeric fields is also not plain text, because of this a simple TermQuery as generated by your query parser will not work, too. If you want to hit numeric values without a NumericRangeQuery with lower and upper bound equal, you have to use NumericUtil

Lucene

2009-10-12 Thread nja
Hi , I am using StandardAnalyzer for indexing as well as searching the indexes.But my search doesn't work correctly with special characters.I am storing some special characters in a field called TransType.ie document.add(new Field("TransType", "db92fb60-b716-11de-8718-001a4bc7d46e", Field

Realtime search best practices

2009-10-12 Thread melix
Hi, I'm going to replace an old reader/writer synchronization mechanism we had implemented with the new near realtime search facilities in Lucene 2.9. However, it's still a bit unclear on how to efficiently do it. Is the following implementation the good way to do achieve it ? The context is con