Re: Detecting duplicates

2011-03-04 Thread Li Li
it's the problem of near duplication detection. there are many papers addressing this problem. methods like simhash are used. 2011/3/5 Mark > Is there a way one could detect duplicates (say by using some unique hash > of certain fields) and marking a document as a duplicate but not remove it. >

Re: Recent Content - Lucene vs. DB SELECT / DB Triggers / Memcached

2011-03-04 Thread DBSight
Can you just maintain the site-wide top 5 by combining top 5 of each Shards? It can be done in memory and just O(1) operations. No Lucene is needed. -- -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http

Detecting duplicates

2011-03-04 Thread Mark
Is there a way one could detect duplicates (say by using some unique hash of certain fields) and marking a document as a duplicate but not remove it. Here is an example: Doc 1) This is my test Doc 2) This is my test Doc 3) Another test Doc 4) This is my test Doc 1 and 3 should be considered u

Re: Multiple IndexWriter question

2011-03-04 Thread Ian Lea
If you've got multiple IndexWriters on one index open at the same time then you must be messing with lucene's locking and all bets are off. >From the javadocs for IndexWriter: Opening an IndexWriter creates a lock file for the directory in use. Trying to open another IndexWriter on the same direc

Re: Recent Content - Lucene vs. DB SELECT / DB Triggers / Memcached

2011-03-04 Thread Ian Lea
I'd go for lucene. The near realtime search stuff should minimise delays in making content visible and if you can do the last X since whenever with a NumericRangeQuery, perhaps in conjunction with a custom Collector, that will be fast too. -- Ian. On Fri, Mar 4, 2011 at 5:59 PM, BrightMinds De

Multiple IndexWriter question

2011-03-04 Thread Brian Coverstone
I am a Lucene newbie, so I apologize beforehand if I am asking anything silly, or that has been covered before. I am currently debugging a project using Lucene. The problem that is happening is searches stop responding when an IndexWriter is writing to the index. In going through the code, I am

Re: Lucene nightly build: similarity score per field

2011-03-04 Thread Patrick Diviacco
ok thanks, one last thing: in my TimeSimilarity class, I just need to use this formula: queryTimeValue - DocTimeValue / normalizationFactor to compute the similarity score of a time/date field. How do you suggest to implement this ? Which methods do I need to overwrite ? thanks On 4 March 2011

Re: Lucene nightly build: similarity score per field

2011-03-04 Thread Robert Muir
On Fri, Mar 4, 2011 at 2:12 PM, Patrick Diviacco wrote: > hey Robert, > > I know there is the documentation, I'm sorry I've confused setSimilarity > with setSimilarityProvider. > > However, my question was about "Similarity get(String field) method" (I > cannot understand from documentation sorry)

Re: Lucene nightly build: similarity score per field

2011-03-04 Thread Patrick Diviacco
hey Robert, I know there is the documentation, I'm sorry I've confused setSimilarity with setSimilarityProvider. However, my question was about "Similarity get(String field) method" (I cannot understand from documentation sorry). Should I create a customSimilarity class implementing the Similari

Re: Lucene nightly build: similarity score per field

2011-03-04 Thread Robert Muir
On Fri, Mar 4, 2011 at 1:18 PM, Patrick Diviacco wrote: > So far, I know I can customize the similarity class for the searcher: > searcher.setSimilarity(new BoostingSimilarity()); > This is not correct.. have you read the javadocs? IndexSearcher doesn't have a setSimilarity() anymore, it has set

Re: Lucene nightly build: similarity score per field

2011-03-04 Thread Patrick Diviacco
All right. So it is still not clear how to exactly implement it. I have SimilarityA and SimilarityB subclasses. So far, I know I can customize the similarity class for the searcher: searcher.setSimilarity(new BoostingSimilarity()); When/how should I use get method ? Similarity get(String field)

Recent Content - Lucene vs. DB SELECT / DB Triggers / Memcached

2011-03-04 Thread BrightMinds Dev
We are developing a large 4-tier multi-server app that will accept Questions and related Comments supplied by users. There will be 100K's of users that live in Shards. Also, ideally there would be no delay in adding content and seeing it in recent results but to make the system performant a d

Re: IndexReader.reopen() question

2011-03-04 Thread Lee
Thanks Ian, and Mike -- the code below was the result of badly copying the Javadocs in exasperation and panic: all points taken with gratitude. Cheers Lee On 04/03/2011 16:40, Ian Lea wrote: Looks basically OK to me. I wonder if you need the isCurrent() check as well as if (newReader != reade

Re: IndexReader.reopen() question

2011-03-04 Thread Michael McCandless
On Fri, Mar 4, 2011 at 8:20 AM, Lee Goddard wrote: > Does this look correct?  I am told it is not functioning, in that new > entries to the index are not being picked-up? Also be careful w/ threads -- if queries are "in flight", closing the reader out from other them will cause problems. -- Mi

Re: IndexReader.reopen() question

2011-03-04 Thread Ian Lea
Looks basically OK to me. I wonder if you need the isCurrent() check as well as if (newReader != reader) but shouldn't do any harm. Likewise there doesn't seem much point in reassigning reader and creating a new searcher if newReader is the same as reader. But I don't think that either of those w

Re: index enforcing query terms to appear within the same sentence

2011-03-04 Thread Ian Lea
Another index, or a different field in the same index but without the modified gaps. Maybe PerFieldAnalyzerWrapper would help - one Analyzer for field x with modified gaps and a different one for field y with standard gaps. -- Ian. On Fri, Mar 4, 2011 at 2:40 PM, Michael Wiegand wrote: > Than

Re: WhitespaceAnalyzer in Lucene nightly build ?

2011-03-04 Thread Ian Lea
Try passing an org.apache.lucene.util.Version parameter. Looks like this version needs it. http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/core/WhitespaceAnalyzer.java?view=markup assuming that is the right link. Using trunk versions is

Is ConcurrentMergeScheduler useful for multiple running IndexWriter's?

2011-03-04 Thread Jason Rutherglen
ConcurrentMergeScheduler is tied to a specific IndexWriter, however if we're running in an environment (such as Solr's multiple cores, and other similar scenarios) then we'd have a CMS per IW. I think this effectively disables CMS's max thread merge throttling feature? ---

Re: WhitespaceAnalyzer in Lucene nightly build ?

2011-03-04 Thread Patrick Diviacco
All right, I've downloaded the 2 jars and I've imported the following line, since WhitespaceAnalyzer is in the core folder. import org.apache.lucene.analysis.core.*; However I get the following error: CollectionIndexer.java:80: cannot find symbol symbol : constructor WhitespaceAnalyzer() locati

RE: WhitespaceAnalyzer in Lucene nightly build ?

2011-03-04 Thread Uwe Schindler
As I said, the nightly maven jars are exactly the same like the nightly zip file. So simply use analyzers-common.jar and lucene-core.jar in your class path after you downloaded it from the URL I told you. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@t

Re: index enforcing query terms to appear within the same sentence

2011-03-04 Thread Michael Wiegand
Thank you for all these useful hints! If I use the multi-valued fields in combination with "modified" position increments, I would actually distort the shape of a document. For instance, if I would like to compare a retrieval enforcing query term co-occurrence within the same sentence with a co

RE: WhitespaceAnalyzer in Lucene nightly build ?

2011-03-04 Thread Steven A Rowe
Hi Patrick, The Jenkins (formerly Hudson) nightly Ant builds do not produce the jar containing WhitespaceAnalyzer. This is not intentional - I just created an issue to track fixing the problem: . The nightly Maven JARs Uwe pointed you to are

Lucene 4.0 and WhitespaceAnalyzer

2011-03-04 Thread Patrick Diviacco
What's the best way to replace WhitespaceAnalyzer in this line in Lucene nightly build 4.0 ? Is there a generic analyzer I can use ? writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); thanks

Re: WhitespaceAnalyzer in Lucene nightly build ?

2011-03-04 Thread Patrick Diviacco
sorry, so what you are saying is that I don't have working analyzers in the nightly build ? In other words, I cannot index with it ? Which version is the nightly Maven JARs ? I actually need to compute similarity per-field: the patch has been committed and it is currently working with Lucene 4.0 T

IndexReader.reopen() question

2011-03-04 Thread Lee Goddard
Hello list, Does this look correct? I am told it is not functioning, in that new entries to the index are not being picked-up? Thanks Lee try { if (! reader.isCurrent()){ IndexReader newReader = reader.reopen(); if (newReader != reader) {

Re: index enforcing query terms to appear within the same sentence

2011-03-04 Thread Ian Lea
You can use multi valued fields if you play with the position increment gap. See e.g. http://lucene.472066.n3.nabble.com/Problem-searching-in-the-same-sentence-td1501269.html A google search for "lucene indexing sentences" or similar finds that, and more. Different docs can have different field