Re: When does QueryParser creates PhraseQueries

2008-02-26 Thread duiduder
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Daniel, thank you very much for the hint! I stepped through the code and tried some scenarios. when I type in with whitespace delimiters ~ termA termB this will result into two invocations of getFieldQuery, one for each term. when I type ~

Re: Transactions in Lucene

2008-02-26 Thread Michael McCandless
Super! Thanks for testing this posting... Mike [EMAIL PROTECTED] wrote: I don't think creating an IndexWriter is very expensive at all. Ah ok. I tested it. Creating an IndexWriter on an index with 10.000 docs (about 15 MB) takes about 200 ms. This is a very cheap operation for me ;)

Re: When does QueryParser creates PhraseQueries

2008-02-26 Thread duiduder
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 So, I stepped throw the QueryParser code further, and I now have found the source for this behaviour: the QueryParserTokenManager ~System.out.println(This one returns the whole String:); ~String strQuery = home/reuschling; ~

Re: Rebuilding Document from index?

2008-02-26 Thread Erick Erickson
See TermDocs/TermEnum. Or perhaps TermFreqVector. I admit I haven't used that last, but that family of methods ought to fix you up. What problem are you trying to solve? Perhaps there are better solutions to suggest Best Erick On Mon, Feb 25, 2008 at 6:04 PM, Itamar Syn-Hershko [EMAIL

RE: Rebuilding Document from index?

2008-02-26 Thread Itamar Syn-Hershko
Implementing something like MoreLikeThis for Hebrew. Non-Hebrew implementations are relevant, but much less accurate since a word like PURIM can show up in the actual document with initials (LPURIM, BPURIM etc.) or even with 1-4 letters after it which all reffer to the same term, and then the

How to find most popular terms quickly?

2008-02-26 Thread Zhang, Lisheng
Hi, I have a very large amount of documents indexed, one field is Brand (untokenized), now I need to find the most popular brand (which brand is used by most Docs), one way is: 1) open IndexReader. 2) call terms() to get all terms, then filter out terms in field Brand. 3) call termDocs(Term) to

Lucene Search Performance

2008-02-26 Thread Jamie
Hi I am looking for a way to improve the search performance of my application. I've followed every suggestion in the Lucene Wiki but the search is still too slow with large indexes. I was wondering whether there was a way to restrict a search to a specific time period and in doing so

Question: Using Shingle Analyzer NGramAnalyzerWrapper in Lucene

2008-02-26 Thread Stanley Xinlei Wang
Hi, In Lucene, I'm trying to perform word-level bi-gram query parsing using NGramAnalyzerWrapper. I'm couldn't get any word pairs in the parsed query and I was wondering what I should do to make this work. I'm using Lucene 2.2.0 I'm using the files from:

Vector Space Model: New Similarity Implementation Issues

2008-02-26 Thread Dharmalingam
Hi List, I am pretty new to Lucene. Certainly, it is very exciting. I need to implement a new Similarity class based on the Term Vector Space Model given in http://www.miislita.com/term-vector/term-vector-3.html Although that model is similar to Lucene’s model

Re: Question: Using Shingle Analyzer NGramAnalyzerWrapper in Lucene

2008-02-26 Thread Stanley Xinlei Wang
Sorry slight correction for the code below: I was actually using the WhitespaceAnalyzer, not the StandardAnalyzer in constructing the NGramAnalyzerWrapper. On Tue, 26 Feb 2008, Stanley Xinlei Wang wrote: Hi, In Lucene, I'm trying to perform word-level bi-gram query parsing using

Inconsistent Search Speed

2008-02-26 Thread fangz
Hi, I am using a simple java program to test the search speed. The index file is about 1.93G in size. I initiated an indexsearcher and built a query using the query parser: parser.parse(entity:fail). The initial run took more than 60 seconds, but the subsequent runs only took 1.5 seconds. This

RE: Question: Using Shingle Analyzer NGramAnalyzerWrapper in Lucene

2008-02-26 Thread Steven A Rowe
Hi Stanley, I modernized the files in LUCENE-400 a bit - you can see the details in comments I made on the issue. The results, including all files needed to address the issue, are in the file attached to the issue named LUCENE-400.patch. I can tell you aren't using the modernized version

Re: Rebuilding Document from index?

2008-02-26 Thread Mathieu Lecarme
Yes, I've found a tester! A patch was submited for this kind of job : https://issues.apache.org/jira/browse/LUCENE-1190 And here is the svn work in progress : https://admin.garambrogne.net/subversion/revuedepresse/trunk/src/java/lexicon And the web version :

Re: Lucene Search Performance

2008-02-26 Thread Jamie
Hi Michael Perhaps this will help. We are using Lucene to index emails and provide a search interface to search through those emails. Many of our customers have 3-5 TB's or more of email data. The index size tends to be around 5 GB per million messages. On a 3 GHZ intel core duo with standard

RE: Rebuilding Document from index?

2008-02-26 Thread Itamar Syn-Hershko
Not to ruin your party, but I'm not sure exactly what this Lexicon object is for and how it should work. Plus, the requirements I have for analyzing Hebrew (not only for the MoreLikeThis functionality) are far more demanding than what is needed for French. But I'm open to any suggestion on this

Re: Lucene Search Performance

2008-02-26 Thread Michael Stoppelman
So you're saying searches are taking 10 seconds on a 5G index? If so that seems ungodly slow. If you're on *nix, have you watched your iostat statistics? Maybe something is hammering your hds. Something seems amiss. What lucene methods were pointed to as hotspots by YourKit? -M On Tue, Feb 26,

Re: Inconsistent Search Speed

2008-02-26 Thread Grant Ingersoll
The first call loads various data structures into memory. The second takes advantage of those structures being in memory. What you want to do is warm the searcher by sending some queries to it before making it available. -Grant On Feb 26, 2008, at 3:49 PM, fangz wrote: Hi, I am

Re: regex expressions within phrase queries

2008-02-26 Thread Chris Hostetter
: Thanks for the advice Chris. What I am working on now is extracting the : matching phrases. The current code for MultiPhraseQuery and SpanQueries : just return all matching terms, not matching phrases. I implemented some : code matching up the TermPositions, but this is pretty slow. Is

Re: Lucene Search Performance

2008-02-26 Thread h t
Hi Michael, I guess the hotspot of lucene is org.apache.lucene.search.IndexSearcher.search() Hi Jamie, What's the original text size of a million emails? I estimate the size of an email is around 100k, is this true? When you doing search, what kind keywords did you input, words or short sentence?

Re: Inconsistent Search Speed

2008-02-26 Thread h t
Did you use the keywords in two calls? 2008/2/27, fangz [EMAIL PROTECTED]: Hi, I am using a simple java program to test the search speed. The index file is about 1.93G in size. I initiated an indexsearcher and built a query using the query parser: parser.parse(entity:fail). The initial

Re: Inconsistent Search Speed

2008-02-26 Thread Mark Miller
The Lucene prime directive: dont iterate through all of Hits! Its horribly inefficient. You must use a hitcollector. Even still, getting your field values will be slow no matter what if you get for every hit. You don't want to do this for every hit in a search. But don't loop through Hits.

Re: Lucene Search Performance

2008-02-26 Thread Anshum
Hi Jamie, Are you running concurrent searches on the index i.e. spawning multiple threads and not handling them? I have been having similar issues and I am planning to try out a workaround for it using Java's Interface Executor.

Re: How to find most popular terms quickly?

2008-02-26 Thread Chris Hostetter
: 1) open IndexReader. : 2) call terms() to get all terms, then filter out terms in field Brand. : 3) call termDocs(Term) to get Docs having each special Brand. : 4) count which term is used by most docs from above result. : : Is this the most efficient way? pretty much ... take a look at the

Re: Security filtering from external DB

2008-02-26 Thread h t
I guess you can implement createBitSet() more effciently by using Filer,but not BooleanQuery 2008/2/25, Gabriel Landais [EMAIL PROTECTED]: Gabriel Landais a écrit : How to create a Filter for a field in CollectionString? First, split Collection in CollectionCollection with