Hi Erik,
I am a new comer to this list and please allow me to ask a dumb
question.
For the StandardAnalyzer, will it have to be modified to accept
different character encodings.
We have customers in China, Taiwan and Hong Kong. Chinese data may come
in 3 different encoding: Big5, GB and UTF8.
This is a pretty interesting problem. I envy you.
I would avoid the existing highlighter for your purposes -- highlighting
in token space is a very differnet problem from "highlihgting" in 2D
space.
based on the XML sample you provided, it looks like your XML files
are allready a "tokenized" fo
I'll make sure no indexing is started before the optimization is done.
Most likely Sunday will be the optimization day for the indexes and every
other night the documents will be added to the index.
Only searching will be available through the web service while optimizing,
but this should not be a
You should be careful, however, not to end up with two VM instances each
trying to open an index writer at the same time - one of them is going
to fail.
Aka, if someone using your web interface tries to add a new document to
the index while you have the optimizer running standalone, the web
i
Thanks all for the useful comments.
It seems that there are even more options --
4/ One index, with a separate Lucene document for each (item,language)
combination, with one field that specifies the language
5/ One index, one Lucene document per item, with field names that include the
language
Thanks for the suggestion, Jian Chen's idea is very similar too.
Probably optimizing that often is not necessary and not that critical for
speeding up the searches.
I'll try changing the index process not to optimize at all and execute the
optimization independently of the indexing on a weekly bas
Ok. So if I get 10 Documents back from a search and I want to get the top 5
weighted terms for each of the 10 documents what API call should I use? I'm
unable to find the connection between Similarity and a Document.
I know I'm missing the elephant that must be in the middle of the room. Or
Hi,
I am involved in a project which is trying to provide searching and hit
highlighting on the scanned image of historical newspapers. We have an XML
based OCR format. A sample is below. We need to index the CONTENT attribute
of the String element which is the easy part. We would like to
On Jun 2, 2005, at 11:56 AM, M. Mokotov wrote:
Hi Erik,
Thank you very much for your reply.
The problem is I need only the caching without any date or query
functionality.
Since the only two constructable-without-wrapping Filters are the
QueryFilter and DateFilter, I need either to have a dum
Hi,
optimize() merges the index segments into one single index segment. In
your case, I guess the 2G index segment is quite large, if you merge
it with any other small index segments, the merging process definitely
will be slow.
I think the performance should be ok without calling optimize().
Mor
I would run your optimize process in a separate thread, so that your web
client doesn't have to wait for it to return.
You may even want to set the optimize part up to run on a weekly
schedule, at a low load time. I probably wouldn't reoptimize after
every 30 documents, on an index that size.
I would like to bring that issue up again as I haven't resolved it yet and
haven't found what's causing it.
Any help, ideas or sharing experience are welcome!
Thanks,
Ross
-Original Message-
From: Angelov, Rossen
Sent: Friday, May 27, 2005 10:42 AM
To: 'java-user@lucene.apache.org'
Subj
Hi Erik,
Thank you very much for your reply.
The problem is I need only the caching without any date or query
functionality.
Since the only two constructable-without-wrapping Filters are the
QueryFilter and DateFilter, I need either to have a dummy filter, or to have
a constructor for the Caching
I don't think you need to add to a term vector.
I guess what I was thinking (and this is just a guess not knowing your
architecture) is that you could have a TokenStreamFilter/Analyzer that
took in the appropriate term vectors, along with your runtime
incremental information. Then, as you are "in
Hi Santanu,
What type of API does Documentum provide? Does it expose the meta-data for
all
of its stored documents? Does it provide access to the actuall documents?
Thanks,
Andrew
-Original Message-
From: Santanu Dutta <[EMAIL PROTECTED]>
Sent: Jun 2, 2005 4:53 AM
To: java-user@luc
Hi,
DefaultSimilarity uses exactly this weighting scheme. Makes sense since it's a
pretty standard relevance measure...
Bye!
max
-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 11:39
To: java-user@lucene.apache.org
Subject: calculate wi = tfi
Hi Andrew
I have experience using lucene for Content Management System (Content
Repository). We are using different file types and different
locale(fe,de,us,ru)
Santanu
-Original Message-
From: Andrew Boyd [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 3:19 PM
To: java-user@luce
Hi All,
Has anyone had any experience using lucene to search a documentum respoitory?
Thanks
Andrew
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
If I have search results how can I calculate, using lucene's API, wi = tfi *
IDFi for each document.
wi= term weight
tfi= term frequency in a document
IDFi = inverse document frequency = log(D/dfi)
dfi = document frequency or number of documents containing term i
D= number of docum
19 matches
Mail list logo