Sure.. using Lucene you could have a field called labels (or tags, as
everyone except google calls them), and just add a bunch of keyword field
values to the field, one for each tag. The tricky part might be doing this
quickly -- right when the user adds a tag, updating the lucene index -- if
It's the Analyzer you're passing into the QueryParser.
StandardAnalyzer turns C++ into c. You can change the .jj grammar
to fix this. (same for C#)
On 6/14/06, Joe Amstadt [EMAIL PROTECTED] wrote:
I'm trying to do a search on ( Java PHP C++ ) with
lucene 1.9. I am using a
It uses a combination of boolean, to get the set of matching
documents, and vector space (by default) to rank them. Or one might
say it uses the vector space model, and only returns nonzero scoring
documents.
On 4/10/06, hu andy [EMAIL PROTECTED] wrote:
I have seen in some documents that there
What about using lucene just for searching (i.e., no stored fields
except maybe one ID primary key field), and using an RDBMS for
storing the actual documents? This way you're using lucene for what
lucene is best at, and using the database for what it's good at. At
least up to a point -- RDBMSs
Peter,
I think this is similar to the patch in this bugzilla task:
http://issues.apache.org/bugzilla/show_bug.cgi?id=35838
the patch itself is http://issues.apache.org/bugzilla/attachment.cgi?id=15757
(BTW does JIRA have a way to display the patch diffs?)
The above patch also has a change to
2. If I choose to sort the results by date, then recent documents with
very very low relevancy (say the words searched appears only in
content, and not in title/bylines/summary fields that are boosted
higher) are still shown relatively high in the list, and I wish to
omit them in general.
One issue is that if you are splitting the index in half (for
example), getting some results from index A and some from index B,
then you need to merge the results somewhere. But the scores coming
from the two indexes are not related at all, for example, document 100
from index A has score 0.85,
It takes the highest scoring document, if greater than 1.0, and
divides every hit's score by this number, leaving them all = 1.0.
Actually, I just looked at the code, and it actually does this by
taking 1/maxScore and then multiplying this by each score (equivalent
results in the end, maybe more
I seem to say this a lot :), but, assuming your OS has a decent
filesystem cache, try reducing your JVM heapsize, using an FSDirectory
instead of RAMDirectory, and see if your filesystem cache does ok. If
you have 12GB, then you should have enough RAM to hold both the old
and new indexes during
Can you post the code you're using to create the Document and adding
it to the IndexWriter? You have to tell lucene to store term freq
vectors (it isn't done by default). Also I'm not sure what you mean
when you say your documents do not have fields. Do you have at least
one field?
-chris
On
Are you using the compound index format (do you have .cfs files)? I
think using the non-compound format might take less space (2.5G less
in your case) when optimizing, since it doesn't have to do that last
step of copying all the index files into the .cfs file.
Also Lucene 1.9 (available from
Lucene just takes the highest score returned, and divides all scores
by this max_score. So max_score / max_score = 1.0, and voila.
On 11/5/05, Karl Koch [EMAIL PROTECTED] wrote:
Hello all,
I am wondering how many of you actually work with own scoring mechanism
(overwriting Lucenes standard
StandardAnalyzer's grammar tokenizes C# and C++ down to C. So you
can either use an analyzer that tokenizes differently (such as
WhitespaceAnalyzer), or modify the JavaCC grammar for StandardAnalyzer
and rebuild your own custom version. If you go the latter route, have
a look at NutchAnalysis.jj
trying to prevent a cron task that
updates the lucene search index from consuming the disk, causing the
search to slow down.
-chris
On 9/1/05, Ben Gollmer [EMAIL PROTECTED] wrote:
Chris Lamprecht wrote:
I've wanted something similar, for the same purpose -- to keep lucene
from consuming disk
Zach,
It probably won't help performance to split the index and then search
it on the same machine unless you search the indexes in parallel (with
a multiprocessor or multi-core machine). Even in this case, the disk
is often a bottleneck, essentially preventing the search from really
running in
I would just create a new QueryParser for each query. Allocating
short-lived objects is just about free in java, and the time spent
performing the actual search will by far dominate any time spent
constructing QueryParser objects.
On 8/24/05, Vanlerberghe, Luc [EMAIL PROTECTED] wrote:
Thanks
See IndexWriter.setMaxFieldLength(), I think it's what you want:
from javadocs:
public void setMaxFieldLength(int maxFieldLength)
The maximum number of terms that will be indexed for a single field in
a document. This limits the amount of memory required for indexing, so
that collections with
I've wanted something similar, for the same purpose -- to keep lucene
from consuming disk I/O resources when another process is running on
the same machine.
A general solution might be to define a simple interface such as
interface IndexInputOutputListener {
void willReadBytes(int
It depends on your usage. When you search, does your code also
retrieve the docs (using Searcher.document(n), for instance). If your
index is 8GB, part of that is the indexed part (searchable), and
part is just stored document fields.
It may be as simple as adding more RAM (try 4, 6, and 8GB)
If you're under an x86_64 machine (AMD opteron, for instance), you may
be able to set your JVM heap this large. But if you have 6GB RAM, you
might try keeping your JVM small (under 1GB), and letting linux's
filesystem cache do the work. Lucene searches are often CPU-bound
(during the search
I'd have to see your indexing code to see if there are any obvious
performance gotchas there. If you can run your indexer under a
profiler (OptimizeIt, JProbe, or just the free one with java using
-Xprof), it will tell you in which methods most of your CPU time is
spent. If you're using
See QueryParser.escape(), it automatically escapes these special
characters for you.
On 6/9/05, Zhang, Lisheng [EMAIL PROTECTED] wrote:
Hi Richard,
Thanks very much! That works.
Lisheng
-Original Message-
From: Richard Krenek [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 09,
Lucene rewrites RangeQueries into a BooleanQuery containing a bunch of
OR'd terms. If you have too many terms (dates in your case), you will
run into a TooManyClauses exception. I think the default is about
1024; you can set it with BooleanQuery.setMaxClauseCount().
On 5/31/05, Kevin Burton
Hi Peter,
See the method escape(String s) of QueryParser, it may do what you want.
On 5/24/05, Peter Gelderbloem [EMAIL PROTECTED] wrote:
Hi,
I am building queries using the query api and when I use } in my fieldname
and then call toString on the query, QueryParser throws a ParseException
You might need a double backslash, since the string \(1\+1\)
represented in Java is \\(1\\+1\\) (see the javadocs for
java.util.regex.Pattern for a better explanation).
On 5/9/05, Kipping, Peter [EMAIL PROTECTED] wrote:
The documentation tells us to escape special characters by using the \
Muffadal,
First, you should add some timing code to determine whether your
database is slow, or your indexing (I think tokenization occurs in the
call to writer.addDocument()). Assuming your database query is the
slowdown, read on...
Depending on the details of your database (which fields are
I was attempting to cache QueryFilters in a Map using the Query as the
key (a BooleanQuery instance containing two RangeQueries), and I
discovered that my BooleanQueries' equals() methods would always
return false, even when the queries were equivalent. The culprit was
RangeQuery - it doesn't
I have to query for a list of old documents (from a given date)
and delete each document individually?
Can I use DateFilter.Before() with Term?
Thanks,
Ben
On Mon, 28 Mar 2005 02:13:48 -0600, Chris Lamprecht
[EMAIL PROTECTED] wrote:
Ben,
If you know the exact terms you want
28 matches
Mail list logo