Kevin A. Burton wrote:
During an optimize I assume Lucene starts writing to a new segment and
leaves all others in place until everything is done and THEN deletes them?
That's correct.
The only settings I uses are:
targetIndex.mergeFactor=10;
targetIndex.minMergeDocs=1000;
the resulting index has
John Wang wrote:
Just for my education, can you maybe elaborate on using the
"implement an IndexReader that delivers a
synthetic index" approach?
IndexReader is an abstract class. It has few data fields, and few
non-static methods that are not implemented in terms of abstract
methods. So, in ef
Kevin A. Burton wrote:
This is why I think it makes more sense to use our own java.io.tmpdir to
be on the safe side.
I think the bug is that Tomcat changes java.io.tmpdir. I thought that
the point of the system property java.io.tmpdir was to have a portable
name for /tmp on unix, c:\windows\tmp
Kevin A. Burton wrote:
No... I changed the mergeFactor back to 10 as you suggested.
Then I am confused about why it should take so long.
Did you by chance set the IndexWriter.infoStream to something, so that
it logs merges? If so, it would be interesting to see that output,
especially the last e
MATL (Mats Lindberg) wrote:
When i copied the lucene jar file to the solaris machine from the
windows machine i used a ftp program.
FTP probably mangled the file. You need to use FTP's binary mode.
Doug
-
To unsubscribe, e-mail: [
Kevin A. Burton wrote:
So is it possible to fix this index now? Can I just delete the most
recent segment that was created? I can find this by ls -alt
Sorry, I forgot to answer your question: this should work fine. I don't
think you should even have to delete that segment.
Also, to elaborate
Kevin A. Burton wrote:
Also... what can I do to speed up this optimize? Ideally it wouldn't
take 6 hours.
Was this the index with the mergeFactor of 5000? If so, that's why it's
so slow: you've delayed all of the work until the end. Indexing on a
ramfs will make things faster in general, howe
John Wang wrote:
The solution you proposed is still a derivative of creating a
dummy document stream. Taking the same example, java (5), lucene (6),
VectorTokenStream would create a total of 11 Tokens whereas only 2 is
neccessary.
That's easy to fix. We just need to reuse the token:
public cl
John Wang wrote:
While lucene tokenizes the words in the document, it counts the
frequency and figures out the position, we are trying to bypass this
stage: For each document, I have a set of words with a know frequency,
e.g. java (5), lucene (6) etc. (I don't care about the position, so it
ca
Julien,
Thanks for the excellent explanation.
I think this thread points to a documentation problem. We should
improve the javadoc for these parameters to make it easier for folks to
In particular, the javadoc for mergeFactor should mention that very
large values (>100) are not recommended, sin
A mergeFactor of 5000 is a bad idea. If you want to index faster, try
increasing minMergeDocs instead. If you have lots of memory this can
probably be 5000 or higher.
Also, why do you optimize before you're done? That only slows things.
Perhaps you have to do it because you've set mergeFacto
> What do your queries look like? The memory required
> for a query can be computed by the following equation:
>
> 1 Byte * Number of fields in your query * Number of
> docs in your index
>
> So if your query searches on all 50 fields of your 3.5
> Million document index then each search would tak
> The best example that I've been able to find is the Yahoo research
> lab - as I understand it, this is a Nutch (i.e. Lucene)
> implementation that's providing impressive performance over a
> 100 million document repository.
This demo runs on a handful of boxes. It was originally running on
thre
Erik Hatcher wrote:
If you want something that does "quick fox*" where "quick" must be
followed by something starting with "fox", you'll have to do this
through the API, perhaps using the awkwardly named PhrasePrefixQuery,
which does support slop also. It would be up to you to do the term
expa
Otis Gospodnetic wrote:
Can anyone comment on performance differences?
I'd expect multi-threaded performance to be a bit worse with the
compound format, but single-threaded performance should be nearly identical.
Doug
-
To unsub
David Spencer wrote:
Does it ever make sense to set the Similartity obj in either (only one
of..) IndexWriter or IndexSearcher? i.e. If I set it in IndexWriter can
I avoid setting it in IndexSearcher? Also, can I avoid setting it in
IndexWriter and only set it in IndexSearcher? I noticed Nutch s
Jayant Kumar wrote:
Thanks for the patch. It helped in increasing the
search speed to a good extent.
Good. I'll commit it. Thanks for testing it.
But when we tried to
give about 100 queries in 10 seconds, then again we
found that after about 15 seconds, the response time
per query increased.
This
Doug Cutting wrote:
Please tell me if you are able to simplify your queries and if that
speeds things. I'll look into a ThreadLocal-based solution too.
I've attached a patch that should help with the thread contention,
although I've not tested it extensively.
I still don't
Jayant Kumar wrote:
Please find enclosed jvmdump.txt which contains a dump
of our search program after about 20 seconds of
starting the program.
Also enclosed is the file queries.txt which contains
few sample search queries.
Thanks for the data. This is exactly what I was looking for.
"Thread-14"
Jayant Kumar wrote:
We recently tested lucene with an index size of 2 GB
which has about 1,500,000 documents, each document
having about 25 fields. The frequency of search was
about 20 queries per second. This resulted in an
average response time of about 20 seconds approx
per search.
That sounds s
requirements for a search. Does this memory
only get used only during the search operation itself,
or is it referenced by the Hits object or anything
else after the actual search completes?
Thanks again,
Jim
--- Doug Cutting <[EMAIL PROTECTED]> wrote:
James Dunn wrote:
Also I search across ab
James Dunn wrote:
Also I search across about 50 fields but I don't use
wildcard or range queries.
Lucene uses one byte of RAM per document per searched field, to hold the
normalization values. So if you search a 10M document collection with
50 fields, then you'll end up using 500MB of RAM.
If
hui wrote:
I am getting the exactly same score like 0. 04809519 for different size
documents for some queries and this happens quite frequently. Based on the
score formula, it seems this should rarely happen. Or I misunderstand the
formula?
Normalization factors (& document boosts) are represented
Leonid Portnoy wrote:
Am I misunderstanding something here, or is the documentation unclear?
The documentation is unclear. Can you propose an improvement?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands
Version 1.4 RC3 of Lucene is available for download from:
http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc3/
Changes are described at:
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.85
Doug
-
To
code. ( see test code )
2.) The first search is always really slow as everything initializes and
the cache fills ;) so don't let that discourage you.
-vito
On Mon, 2004-04-26 at 14:59, Doug Cutting wrote:
Anthony Vito wrote:
I noticed some talk on SQLDirectory a month or so ago. .
Di
Matthew W. Bilotti wrote:
We suspect the coordination term in driving down
these documents' ranks and we would like to bring those documents back up
to where they should be.
That sounds right to me.
Is there a relatively easy way to implement what we want using Lucene?
Would it be better to t
Please don't crosspost to lucene-user and lucene-dev!
Tate Avery wrote:
3) The maxClauseCount threshold appears not to care whether or not my
clauses are 'required' or 'prohibited'... only how many of them there are in
total.
That's correct. It is an attempt to stop out-of-memory errors which can
Incze Lajos wrote:
Could anybody summarize what would be the technical pros/cons of a DB-based
directory over the flat files? (What I see at the moment is that for some
- significant? - perfomence penalty you'll get an index available over the
network for multiple lucene engines -- if I'm right.)
h
Ioan Miftode wrote:
I recently upgraded to lucene 1.4 RC2 because I needed some
sorting capabilities. However some phrase searches don't
work anymore (the hits don't even have the term's I'm searching on).
Try the latest CVS. There were some bugs in 1.4RC2 that have been fixed.
(We'll probably do
Yukun Song wrote:
As known, currently Lucene uses flat file to store information for
indexing.
Any people has idea or resources for combining database (Like MySQL or
PostreSQL) and Lucene instead of current flat index file formats?
A few folks have implemented an SQL-based Lucene Directory, but n
Anthony Vito wrote:
I noticed some talk on SQLDirectory a month or so ago. ( I just joined
the list :) ) I have a JDBC implementation that stores the "files" in a
couple of tables and stores the data for the files as blocks (BLOBs) of
a certain size ( 16k by default ). It also has an LRU cache fo
Win32 seems to sometimes not permit one to delete a file immediately
after it has been closed. Because of this, Lucene keeps a list of files
that need to be deleted in the 'deleteable' file. Are your files listed
in this file? If so, Lucene will again try to delete these files the
next time
Francesco Bellomi wrote:
The only problem is that, as lucene 1.4rc2, FSDirectory is 'final'.
Please submit a patch to lucene-dev to make FSDirectory non-final.
In fact, a third architectural approach would be to define an API for
"pluggable" lock implementations: IMHO that would be more robust to
Francesco Bellomi wrote:
we are experiencing some difficulties in using Lucene with a NFS filesystem.
Basically, locking seems not to work properly, since it appears that
attempted concurring writing on the index (from different VMs) are not
blocked, and this often causes the index to be corrupted.
Weir, Michael wrote:
So if our server is the only process that ever opens the index, I should be
able to run through the indexes at startup and simply unlock them?
Yes.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additio
Weir, Michael wrote:
I assume that it is possible to corrupt an index by crashing at just the right
time.
It should not be possible to corrupt an index this way.
I notice that there's a method IndexReader.unlock(). Does this method
ensure that the index has not been corrupted?
If you use this met
Magnus Mellin wrote:
i would like to partition an index over X number of remote searchers.
Any ideas, or suggestions, on how to use the same term dictionary (one
that represents the terms and frequencies for the whole document
collection) over all my indices?
Try using a ParallelMultiSearcher com
Chad Small wrote:
We have a requirement to return documents with a "title" field that starts with a certain letter. Is there a way to do something like this? We're using the StandardAnalyzer
Example title fields:
This is the title of a document.
And this is a title of a different document.
peters marcus wrote:
is there a way to get all words stored in the index for a given document
Yes, in the 1.4 release:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVectors(int)
Doug
-
Joe Rayguy wrote:
So, assuming that sort as implemented in 1.4 doesn't
work for me, my original question still stands. Do I
have to worry about merges that occur as documents are
added, or do I only have to rebuild my array after
optimizations? Or, alternatively, how did everyone
sort before 1.4?
Terry,
Can you please try to develop a reproducible test case? Otherwise it's
impossible to verify and debug this.
For something like this it would suffice to provide:
1. The initial index, which satisifies the test queries;
2. The new index you add;
3. Your merge and test code, as a s
Kevin A. Burton wrote:
Doug Cutting wrote:
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1413989
According to these, if your documents average 16k, then a 10-hit
result page would require just 66ms to generate highlights using
SimpleAnalyzer.
The whole search takes only 3
Doug Cutting wrote:
According to these, if your documents average 16k, then a 10-hit result
page would require just 66ms to generate highlights using SimpleAnalyzer.
Oops. That should be 110ms.
Doug
-
To unsubscribe, e-mail
[EMAIL PROTECTED] wrote:
As a note of warning: I did find StandardTokenizer to be the major culprit in my
tokenizing benchmarks (avg 75ms for 16k sized docs).
I have found I can live without StandardTokenizer in my apps.
FYI, the message with Mark's timings can be found at:
http://nagoya.apache.o
Kevin A. Burton wrote:
I'm playing with this package:
http://home.clara.net/markharwood/lucene/highlight.htm
Trying to do hit highlighting. This implementation uses another
Analyzer to find the positions for the result terms.
This seems that it's very inefficient
Does it just seem inefficient,
Esmond Pitt wrote:
Don't want to start a buffer size war, but these have always seemed too
small to me. I'd recommend upping both InputStream and OutputStream buffer
sizes to at least 4k, as this is the cluster size on most disks these days,
and also a common VM page size.
Okay.
Reading and writin
Lucene 1.4 has not been released. Until it is released, you need to
check out the sources from CVS and build them, including javadoc.
Doug
Stephane James Vaucher wrote:
Are the javadocs available on the site?
I'd like to see the javadocs for lucene-1.4 (specifically SpanQuery)
somewhere on the
Kevin A. Burton wrote:
One way to force larger read-aheads might be to pump up Lucene's input
buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE
to 1024*1024 or larger. You'll want to do this just for the merge
process and not for searching and indexing. That should help yo
[EMAIL PROTECTED] wrote:
Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my
implementation :-)
Great! Glad to hear it was useful.
BTW, I've had a thought about your suggestion for making the highlighter use some form of RAMindex of sentence fragments
Boris Goldowsky wrote:
I have a situation where I'm querying for something in several fields,
with a clause similar to this:
(title:(two words)^20 keywords:(two words)^10 body:(two words))
Some good documents are being scored too low if the query terms do not
occur in the "body" field. I naive
Kevin A. Burton wrote:
We're using lucene with one large target index which right now is 5G.
Every night we take sub-indexes which are about 500M and merging them
into this main index. This merge (done via
IndexWriter.addIndexes(Directory[]) is taking way too much time.
Looking at the stats f
Charlie Smith wrote:
I'll vote yes please release new version with "too many files open" fixed.
There is no "too many files open bug", except perhaps in your
application. It is however an easy to encounter problem if you don't
close indexes or if you change Lucene's default parameters. It will
[EMAIL PROTECTED] wrote:
I have not been able to work out how to get custom coordination going to
demote results based on a specific term [ ... ]
Yeah, it's a little more complicated than perhaps it should be.
I've attached a class which does this. I think it's faster and more
effective than wh
Chad Small wrote:
thanks Erik. Ok this is my official lobby effort for the release of 1.4 to final status. Anyone else need/want a 1.4 release?
Does anyone have any information on 1.4 release plans?
I'd like to make an RC once I manage to fix bug #27799, which will
hopefully be soon.
Doug
--
Eric Jain wrote:
I will need to have a look at the code, but I assume that in principal
it should be possible to replace the strings with sequential integers
once the sorting is done?
I don't understand the question.
Doug
-
To un
Eric Jain wrote:
That's reasonable. What I didn't quite understand yet: If I sort on a
string field, will Lucene need to keep all values in memory all the
time, or only during startup?
It will cache one instance of each unique value. So if you have a
million documents and string sort results on a
Eric Jain wrote:
Just to clarify things: Does the current solution require all fields
that can be used for sorting to be loaded and kept in memory? (I guess
you can answer this question faster than I can figure it out by myself
:-)
Field values are loaded into memory. But values are kept in an arr
Boris Goldowsky wrote:
How difficult would it be to implement something like Cover Density
ranking for Lucene? Has anyone tried it?
Cover density is described at http://citeseer.ist.psu.edu/558750.html ,
and is supposed to be particularly good for short queries of the type
that you get in many
Doug Cutting wrote:
On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
Have you tried assigning these very small boosts (0 < boost < 1) and
assigning other query clauses relatively large boosts (boost > 1)?
I don't think you understood my proposal. You should try boosting the
docu
Boris Goldowsky wrote:
On Thu, 2004-03-18 at 13:32, Doug Cutting wrote:
Have you tried assigning these very small boosts (0 < boost < 1) and
assigning other query clauses relatively large boosts (boost > 1)?
I was trying to formulate a query like, say
+(title: asparagus) (doctyp
Have you tried assigning these very small boosts (0 < boost < 1) and
assigning other query clauses relatively large boosts (boost > 1)?
Boris Goldowsky wrote:
Is there any way to build a query where the occurrence of a particular
Term (in a Keyword field) causes the rank of the document to be
dec
Sam Hough wrote:
Can anybody confirm that no guarantee is given that Fields retain
their order within a Document?
Version 1.3 seems to (although reversing the order
on occasion).
In 1.3 they're reversed as added, then reversed as read, so that hits
have fields in their added order. In 1.4 I've fi
hui wrote:
If the document id is going to be changed, is it possible to define an
interface so the user could provide other implementation to replace the
default one? For example, the document unique timestamp or other fields as
long as they are long could be used.
I don't think that would be a goo
Chris Kimm wrote:
Unfortunately, I'm not able to batch the updates. The application needs
to make some descisions based on what each document looks like before
and after the update, so I have to do it one at a time.
Are these decisions dependent on other documents? If not, you should be
able
It sounds like you're not batching your updates.
The most efficient approch to update 1000 documents would be to:
1. Open an IndexReader;
2. Delete all 1000 documents.
3. Close the reader;
4. Open an IndexWriter;
5. Add all 1000 updated documents;
6. Close the IndexWriter.
Is that wha
Kevin A. Burton wrote:
A discussion I had a while back had someone note (Doug?) that the
decision to go with 32bit ints for document IDs was that on 32 bit
machines that 64bits weren't threadsafe.
Somone, not me, perhaps provided that rationalization, which isn't a bad
one. In fact, the situati
Kevin A. Burton wrote:
> 3. Have two directories on the searcher. The indexer would then
sync to
a tmp
directory and then at run time swap them via a rename once the sync is
over.
The downside here is that this will take up 2x disk space on the
searcher. The
upside is that the box will only s
Erik Hatcher wrote:
Yes, I saw it. But is there a reason not to just expose HashSet given
that it is the data structure that is most efficient? I bought into
Kevin's arguments that it made sense to just expose HashSet.
Just the general principal that one shouldn't expose more of the
implementa
Erik Hatcher wrote:
Also... you're HashSet constructor has to copy values from the
original HashSet into the new HashSet ... not very clean and this can
just be removed by forcing the caller to use a HashSet (which they
should).
I've caved in and gone HashSet all the way.
Did you not see my mess
Jeff Wong wrote:
I noticed that Lucene 1.3-final source builds a JAR file whose version
number is "1.4-rc1-dev". What does this mean? Will 1.4-final build as
"1.5-rc1-dev"?
Probably. If you modify the sources of a 1.3-final release, and build
them, you're not building 1.3-final, but a derivativ
David Spencer wrote:
Maybe I missed something but I always thought the stop list should be a
Set, not a Map (or Hashtable/Dictionary). After all, all you need to
know is existence and that's what a Set does.
Good point.
Doug
-
Erik Hatcher wrote:
Well, one issue you didn't consider is changing a public method
signature. I will make this change, but leave the Hashtable signature
method there. I suppose we could change the signature to use a Map
instead, but I believe there are some issues with doing something like
t
hui wrote:
Index time:
compound format is 89 seconds slower.
compound format:
1389507 total milliseconds
non-compound format:
1300534 total milliseconds
The index size is 85m with 4 fields only. The files are stored in the index.
The compound format has only 3 files and the other has 13 files.
T
Erik Hatcher wrote:
private static final DecimalFormat formatter =
new DecimalFormat("0"); // make this as wide as you need
For ints, ten digits is probably safest. Since Lucene uses prefix
compression on the term dictionary, you don't pay a penalty at search
time for long shared pre
hui wrote:
Not yet. For the compound file format, when the files get bigger, if I add
few new files frequently, the bigger files has to be updated. Will that
affect lot on the search and produce heavier disk I/O compared with the
traditional index format? It seems OS cache makes quite difference wh
Stephane James Vaucher wrote:
As I've stated in my earlier mail, I like this change. More importantly,
could this become a "standard" way of changing configurations at runtime?
For example, the default merge factor could also be set in this manner.
Sure, that's reasonable, so this would be someth
Michael Duval wrote:
> I've hacked the code for the time being by updating FSDirectory and
replaced all System.getProperty("java.io.tmpdir")
calls with a call to a new method "getLockDir()". This method checks
for a "lucene.lockdir" prop before the
"java.io.tmpdir" prop giving the end user a bi
Michael Steiger wrote:
I'm wondering that there are no samples for this job. I do not think
that I am the first one looking for this.
If you found this confusing, and would have been helped by some
examples, please take the time to donate some good examples. Lucene is
free, but requires donati
Morus Walter wrote:
Now I think this can be fixed in the query parser alone by simply allowing
'-' within words.
That is change
<#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
to
<#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) >
As a result, query parser will read '-' within w
Erik Hatcher wrote:
On Feb 27, 2004, at 6:17 PM, Doug Cutting wrote:
I think it's document.add(). Fields are pushed onto the front, rather
than added to the end.
Ah, ok DocumentFieldList/DocumentFieldEnumeration are the culprits.
This is certainly a bug.
Yes, a bug that's
I think it's document.add(). Fields are pushed onto the front, rather
than added to the end.
Doug
Roy Klein wrote:
I think it's got something to do with Document.invertDocument().
When I reverse the words in the phrase, the other document matches the
phrase query.
Roy
-Original M
Roy Klein wrote:
E.g.
doc1.add(Field.indexed("field","the");
doc1.add(Field.indexed("field","quick");
doc1.add(Field.indexed("field","brown");
doc1.add(Field.indexed("field","fox");
doc1.add(Field.indexed("field","jumped");
writer.addDocument(doc1);
Vs.
doc2.add(Field.indexed("
Matt Quail wrote:
Is there any way to iterate through a TermEnum backwards? Okay, I know
that there isn't a way to do this via the TermEnum class, but is it
"implementable" on top of the underlying Lucene datastore?
Not really. The best you can do is skip back to the previous "indexed"
term in Te
How could Lucene know that something is "duplicate but older"? Sounds
like an application-specific thing.
Doug
Kevin A. Burton wrote:
Is there any way to prevent lucene from returning duplicate (but
'older') results from returning within a search result?
Kevin
---
Anson Lau wrote:
I'm trying to see what are some common ways to scale lucene onto
multiple boxes. Is RMI based search and using a MultiSearcher the
general approach?
Yes, although you probably want to use ParallelMultiSearcher.
Doug
---
Michael,
What JVM and OS are you using?
Your attachment did not make it through. If you continue to have
problems please submit a bug report and attach test code there.
Thanks,
Doug
Michael A. Schoen wrote:
I am using 1.3-final. Specifically I'm using the jar files from
lucene-1.3-final.zip.
David Townsend wrote:
Does this mean that if an IndexSearcher has hold of a segment file, then the index is optimised, any subsequent search will use a list of files that probably don't exist anymore?
The IndexSearcher (through an IndexReader) has the files open, so it is
still valid, and may be s
Alan Smith wrote:
1. What happens if i make a backup (copy) of an index while documents
are being added? Can it cause problems, and if so is there a way to
safely do this?
This is not in general safe. A copy may not be a usable index. The
segments file points to the current set of files. An I
Rasik Pandey wrote:
Does anyone know of an implementation of a MultiReader (IndexReader over multiple indices) in the same spirit as the MultiSearcher?
I just committed one! This was really already there, in SegmentsReader,
but it was not public and needed a few minor changes. Enjoy.
Doug
David Spencer wrote:
Code rewritten, automagically chooses lots of defaults, lets you override
the defs thru the static vars at the bottom or the non-static vars also
at the bottom.
Has anyone used this? Was it useful? Should we add it to the sandbox?
Doug
-
David Spencer wrote:
2 files attached, SubstringQuery (which you'll use) and
SubstringTermEnum ( used by the former to be
consistent w/ other Query code).
I find this kind of query useful to have and think that the query parser
should allow it in spite of the perception
of this being slow, howev
Esmond Pitt wrote:
I have a field Author: and I'm using the StandardAnalyzer. When documents
with this field are added to the index, the field name 'Author' is
case-folded by the analyzer to 'author', and this is how it appears in the
index.
An analyzer does not process field names when indexing.
Daniel B. Davis wrote:
Are there other strategies not considered?
Why not store sponsored documents in a separate index, separately
searched, whose results are placed above those from the non-sponsored
documents?
Doug
-
To unsu
would still be lousy, but add performance might be OK if the add operations were done in memory before committing them to the database. there would be a second index column, something like index number or something like that.
Herb...
-Original Message-
From: Doug Cutting [mailto:[EMAIL PROT
fast nor slow. i gather that each term's posting list was an individual BLOB in the database. the term string was used as the index column. i believe the group used stemming.
Herb...
-Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Friday, February 06, 2004
Dror Matalon wrote:
I suspect you're going to get lousy performence compared to using
regular files.
Perhaps, but in theory it shouldn't be a lot worse than, e.g., accessing
an index over NFS. The tables might get fragmented as the index
evolves, and database optimization might help performance.
Philippe Laflamme wrote:
I've worked on an implementation for Postgres. I used the Large Object API
provided by the Postgres JDBC driver. It works fine but I doubt it is very
scalable because the number of open connections during indexing can become
very high.
Lucene opens many different files when
Using the terminology in
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
fieldNorm is defined as
getBoost(t.field in d) * lengthNorm(t.field in d)
These two values are multipled into a single value at index time, and it
is unfortunately impossible to separa
Karl Koch wrote:
Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists and stemming?
Using term frequencies of the document is not really possible since lucene
is not providing access to a document vector, isn't it?
Lucene does let you acce
101 - 200 of 458 matches
Mail list logo