Re: Lucene Query

2014-08-19 Thread Tri Cao
not matter. Uwe Am 19. August 2014 22:05:23 MESZ, schrieb Tri Cao :        >OR operator does that, AND only returns docs with ALL terms present.        >        >Note that you have two options here        >1. Create a BooleanQuery object (see the Java doc I linked below) and        &

Re: Lucene Query

2014-08-19 Thread Tri Cao
Whoops, the constraint should be MUST to force all terms present: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST On Aug 19, 2014, at 01:05 PM, "Tri Cao" wrote: OR operator does that, AND only returns docs with ALL terms present. Not

Re: Lucene Query

2014-08-19 Thread Tri Cao
g 19, 2014, at 12:17 PM, Jin Guang Zheng wrote: Thanks for reply, but won't BooleanQuery return both doc1 and doc2 with query: label:States AND label:America AND label:United Best, Jin On Tue, Aug 19, 2014 at 2:07 PM, Tri Cao wrote:        > given that example, the easy way is a bo

Re: Lucene Query

2014-08-19 Thread Tri Cao
given that example, the easy way is a boolean AND query of all the terms: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html However, if your corpus is more sophisticated you'll find that relevance ranking is not always that trivial :) On Aug 19, 2014, at 11:00

Re: Calculate Term Frequency

2014-08-19 Thread Tri Cao
Erick, Solr termfreq implementation also uses DocsEnum with the assumption that freq are called on ascending doc IDs which is valid when scoring from from the hit list. If freq is requested for an out of order doc, a new DocsEnum has to be created. Bianca, can you explain your use case in more

Re: deleteDocument with NRT

2014-07-14 Thread Tri Cao
On Jul 14, 2014, at 03:09 AM, Ganesh wrote: How Solr handles this scenario... Is it reopening reader after every delete OR it maintains the list of delete documents in cache? Regards Ganesh On 7/11/2014 4:00 AM, Tri Cao wrote:        > You need to reopen your searcher after

Finding words not followed by other words

2014-07-11 Thread Tri Cao
This is actually a tough problem in general: polysemy sense disambiguation. In your case, I think it's more like you'll probably need to do some named entity resolution to differentiate "George Washington" from "George Washington Carver" as they are two different entities. Do you have a list o

Re: deleteDocument with NRT

2014-07-10 Thread Tri Cao
You need to reopen your searcher after deleting. From Java doc for SearcherManager: In addition you should periodically call maybeRefresh. While it's possible to call this just before running each query, this is discouraged since it penalizes the unlucky queries that do the reopen. It's better

Re: How to handle words that stem to stop words

2014-07-07 Thread Tri Cao
I think emitting two tokens for "vans" is the right (potentially only) way to do it. You could also control the dictionary of terms that require this special treatment. Any reason makes you not happy with this approach? On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden wrote: Hello list,

Re: Can Lucene based application be made to work with Scaled Elastic Beanstalk environemnt on Amazon Web Services

2014-06-27 Thread Tri Cao
I would just use S3 as a data push mechanism. In your servlet's init(), you could download the index from S3 and unpack it to a local directory, then initialize your Lucene searcher to that directory.  Downloading from S3 to EC2 instances is free, and 5G would take a minute or two. Also, if you p

Re: search performance

2014-06-02 Thread Tri Cao
This is an interesting performance problem and I think there is probably not a single answer here, so I'll just layout the steps I would take to tackle this: 1. What is the variance of the query latency? You said the average is 5 minutes, but is it due to some really bad queries or most queries h

Re: maxDoc/numDocs int fields

2014-03-21 Thread Tri Cao
I ran into this issue before and after some digging, I don't think there is an easy way to accommodate long IDs in Lucene. So I decided to go with sharding documents into multiple indexes. It turned out to be a good decision in my case because I would have to shard the index anyway for performance

Re: How to search for terms containing negation

2014-03-17 Thread Tri Cao
analyzer/parser would you recommend? Thank you again, Natalia On Mon, Mar 17, 2014 at 3:35 PM, Tri Cao <tm...@me.com> wrote: Natalia,First make sure that your analyzers (both index and query analyzers) donot filter out these as stop words. I think the standard StopFilter listhas "no&qu

Re: How to search for terms containing negation

2014-03-17 Thread Tri Cao
Natalia,First make sure that your analyzers (both index and query analyzers) do not filter out these as stop words. I think the standard StopFilter list has "no" and "not". You can try to see if you index have these terms by querying for "no" as a TermQuery. If there is not match for that query, th

Re: IndexWriter croaks on large file

2014-02-19 Thread Tri Cao
ing if there's anything I should be aware of. Regards, John On 2/14/14 4:37 PM, Tri Cao wrote:As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have

Re: IndexWriter croaks on large file

2014-02-14 Thread Tri Cao
As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have a lot of value, even when youknow a doc is a hit for a certain query, how are you going to display the results to use

Re: Collector is collecting more than the specified hits

2014-02-14 Thread Tri Cao
If I understand correctly, you'd like to shortcut the execution when you reach the desirednumber of hits. Unfortunately, I don't think there's a graceful way to do that right now inCollector. To stop further collecting, you need to throw an IOException (or a subtype of it)and catch the exception la

Re: incrementally indexing

2012-07-05 Thread Tri Cao
If you want to index your hard drive, you'll need to keep a copy of the current file system's directory/files structure. Otherwise, you won't be able to remove from your index files that have been deleted. On Jul 5, 2012, at 12:18 PM, Erick Erickson wrote: > Hmmm, it's not quite clear what the p

Re: custom scoring

2012-04-08 Thread Tri Cao
through the TopDocs and apply the constraints I need toI think this will work, but have some concern about performance. What would you think?Thanks,Tri.On Apr 06, 2012, at 10:06 AM, Tri Cao wrote:Hi all,What would be the best approach for a custom scoring that requires a "global" view of

custom scoring

2012-04-06 Thread Tri Cao
Hi all,What would be the best approach for a custom scoring that requires a "global" view of the result set. For example, I have a field call "color" and I would like to have constraints that there are at most 3 docs with color:red, 4 docs with color:blue in the first 16 hits. And the items should