Re: labels search like google mail?

2006-11-23 Thread Chris Lamprecht
Sure.. using Lucene you could have a field called labels (or tags, as everyone except google calls them), and just add a bunch of keyword field values to the field, one for each tag. The tricky part might be doing this quickly -- right when the user adds a tag, updating the lucene index -- if

Re: MultiFieldQueryParser Search On C++ problem

2006-06-16 Thread Chris Lamprecht
It's the Analyzer you're passing into the QueryParser. StandardAnalyzer turns C++ into c. You can change the .jj grammar to fix this. (same for C#) On 6/14/06, Joe Amstadt [EMAIL PROTECTED] wrote: I'm trying to do a search on ( Java PHP C++ ) with lucene 1.9. I am using a

Re: What is the retrieval modle for lucene?

2006-04-11 Thread Chris Lamprecht
It uses a combination of boolean, to get the set of matching documents, and vector space (by default) to rank them. Or one might say it uses the vector space model, and only returns nonzero scoring documents. On 4/10/06, hu andy [EMAIL PROTECTED] wrote: I have seen in some documents that there

Re: Distributed Lucene.. - clustering as a requirement

2006-04-06 Thread Chris Lamprecht
What about using lucene just for searching (i.e., no stored fields except maybe one ID primary key field), and using an RDBMS for storing the actual documents? This way you're using lucene for what lucene is best at, and using the database for what it's good at. At least up to a point -- RDBMSs

Re: Throughput doesn't increase when using more concurrent threads

2006-03-10 Thread Chris Lamprecht
Peter, I think this is similar to the patch in this bugzilla task: http://issues.apache.org/bugzilla/show_bug.cgi?id=35838 the patch itself is http://issues.apache.org/bugzilla/attachment.cgi?id=15757 (BTW does JIRA have a way to display the patch diffs?) The above patch also has a change to

Re: Help: tweaking search - reducing IDF skew and implementing score cutoff

2006-02-10 Thread Chris Lamprecht
2. If I choose to sort the results by date, then recent documents with very very low relevancy (say the words searched appears only in content, and not in title/bylines/summary fields that are boosted higher) are still shown relatively high in the list, and I wish to omit them in general.

Re: Distributed vs Merged Searching

2006-02-01 Thread Chris Lamprecht
One issue is that if you are splitting the index in half (for example), getting some results from index A and some from index B, then you need to merge the results somewhere. But the scores coming from the two indexes are not related at all, for example, document 100 from index A has score 0.85,

Re: How does the lucene normalize the score?

2006-01-27 Thread Chris Lamprecht
It takes the highest scoring document, if greater than 1.0, and divides every hit's score by this number, leaving them all = 1.0. Actually, I just looked at the code, and it actually does this by taking 1/maxScore and then multiplying this by each score (equivalent results in the end, maybe more

Re: Performance tips?

2006-01-26 Thread Chris Lamprecht
I seem to say this a lot :), but, assuming your OS has a decent filesystem cache, try reducing your JVM heapsize, using an FSDirectory instead of RAMDirectory, and see if your filesystem cache does ok. If you have 12GB, then you should have enough RAM to hold both the old and new indexes during

Re: TermFreqVector

2005-11-17 Thread Chris Lamprecht
Can you post the code you're using to create the Document and adding it to the IndexWriter? You have to tell lucene to store term freq vectors (it isn't done by default). Also I'm not sure what you mean when you say your documents do not have fields. Do you have at least one field? -chris On

Re: Optimize vs non optimized index

2005-11-16 Thread Chris Lamprecht
Are you using the compound index format (do you have .cfs files)? I think using the non-compound format might take less space (2.5G less in your case) when optimizing, since it doesn't have to do that last step of copying all the index files into the .cfs file. Also Lucene 1.9 (available from

Re: Question about scoring normalisation

2005-11-05 Thread Chris Lamprecht
Lucene just takes the highest score returned, and divides all scores by this max_score. So max_score / max_score = 1.0, and voila. On 11/5/05, Karl Koch [EMAIL PROTECTED] wrote: Hello all, I am wondering how many of you actually work with own scoring mechanism (overwriting Lucenes standard

Re: searching on special characters as in C++

2005-10-06 Thread Chris Lamprecht
StandardAnalyzer's grammar tokenizes C# and C++ down to C. So you can either use an analyzer that tokenizes differently (such as WhitespaceAnalyzer), or modify the JavaCC grammar for StandardAnalyzer and rebuild your own custom version. If you go the latter route, have a look at NutchAnalysis.jj

Re: IO bandwidth throttling

2005-09-01 Thread Chris Lamprecht
trying to prevent a cron task that updates the lucene search index from consuming the disk, causing the search to slow down. -chris On 9/1/05, Ben Gollmer [EMAIL PROTECTED] wrote: Chris Lamprecht wrote: I've wanted something similar, for the same purpose -- to keep lucene from consuming disk

Re: Ideal Index Fragmentation

2005-08-30 Thread Chris Lamprecht
Zach, It probably won't help performance to split the index and then search it on the same machine unless you search the indexes in parallel (with a multiprocessor or multi-core machine). Even in this case, the disk is often a bottleneck, essentially preventing the search from really running in

Re: QueryParser not thread-safe

2005-08-24 Thread Chris Lamprecht
I would just create a new QueryParser for each query. Allocating short-lived objects is just about free in java, and the time spent performing the actual search will by far dominate any time spent constructing QueryParser objects. On 8/24/05, Vanlerberghe, Luc [EMAIL PROTECTED] wrote: Thanks

Re: Indexing terms limit

2005-08-10 Thread Chris Lamprecht
See IndexWriter.setMaxFieldLength(), I think it's what you want: from javadocs: public void setMaxFieldLength(int maxFieldLength) The maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with

Re: IO bandwidth throttling

2005-08-02 Thread Chris Lamprecht
I've wanted something similar, for the same purpose -- to keep lucene from consuming disk I/O resources when another process is running on the same machine. A general solution might be to define a simple interface such as interface IndexInputOutputListener { void willReadBytes(int

Re: Hardware Question

2005-07-27 Thread Chris Lamprecht
It depends on your usage. When you search, does your code also retrieve the docs (using Searcher.document(n), for instance). If your index is 8GB, part of that is the indexed part (searchable), and part is just stored document fields. It may be as simple as adding more RAM (try 4, 6, and 8GB)

Re: Loading large index into RAM

2005-07-08 Thread Chris Lamprecht
If you're under an x86_64 machine (AMD opteron, for instance), you may be able to set your JVM heap this large. But if you have 6GB RAM, you might try keeping your JVM small (under 1GB), and letting linux's filesystem cache do the work. Lucene searches are often CPU-bound (during the search

Re: Ideas Needed - Finding Duplicate Documents

2005-06-12 Thread Chris Lamprecht
I'd have to see your indexing code to see if there are any obvious performance gotchas there. If you can run your indexer under a profiler (OptimizeIt, JProbe, or just the free one with java using -Xprof), it will tell you in which methods most of your CPU time is spent. If you're using

Re: Lucene 1.4.3 QueryParser cannot parse great! ?

2005-06-09 Thread Chris Lamprecht
See QueryParser.escape(), it automatically escapes these special characters for you. On 6/9/05, Zhang, Lisheng [EMAIL PROTECTED] wrote: Hi Richard, Thanks very much! That works. Lisheng -Original Message- From: Richard Krenek [mailto:[EMAIL PROTECTED] Sent: Thursday, June 09,

Re: Finding minimum and maximum value of a field?

2005-05-31 Thread Chris Lamprecht
Lucene rewrites RangeQueries into a BooleanQuery containing a bunch of OR'd terms. If you have too many terms (dates in your case), you will run into a TooManyClauses exception. I think the default is about 1024; you can set it with BooleanQuery.setMaxClauseCount(). On 5/31/05, Kevin Burton

Re: Query.toString(0 does not escape special characters

2005-05-24 Thread Chris Lamprecht
Hi Peter, See the method escape(String s) of QueryParser, it may do what you want. On 5/24/05, Peter Gelderbloem [EMAIL PROTECTED] wrote: Hi, I am building queries using the query api and when I use } in my fieldname and then call toString on the query, QueryParser throws a ParseException

Re: QueryParser and Special Characters

2005-05-09 Thread Chris Lamprecht
You might need a double backslash, since the string \(1\+1\) represented in Java is \\(1\\+1\\) (see the javadocs for java.util.regex.Pattern for a better explanation). On 5/9/05, Kipping, Peter [EMAIL PROTECTED] wrote: The documentation tells us to escape special characters by using the \

Re: Lucene bulk indexing

2005-04-19 Thread Chris Lamprecht
Muffadal, First, you should add some timing code to determine whether your database is slow, or your indexing (I think tokenization occurs in the call to writer.addDocument()). Assuming your database query is the slowdown, read on... Depending on the details of your database (which fields are

RangeQuery doesn't override equals() or hashCode() - intentional?

2005-04-11 Thread Chris Lamprecht
I was attempting to cache QueryFilters in a Map using the Query as the key (a BooleanQuery instance containing two RangeQueries), and I discovered that my BooleanQueries' equals() methods would always return false, even when the queries were equivalent. The culprit was RangeQuery - it doesn't

Re: batch delete

2005-03-28 Thread Chris Lamprecht
I have to query for a list of old documents (from a given date) and delete each document individually? Can I use DateFilter.Before() with Term? Thanks, Ben On Mon, 28 Mar 2005 02:13:48 -0600, Chris Lamprecht [EMAIL PROTECTED] wrote: Ben, If you know the exact terms you want