Hi,
Thanx for replying. In my scenario i'm not going to index any of my docs.
So is there a way to find out term frequencies of the terms in a doc
without doing the indexing part?
Thanx in advance,
Hari
On 4/12/07, Grant Ingersoll [EMAIL PROTECTED] wrote:
Add Term Vectors to your Field during
12 apr 2007 kl. 09.12 skrev sai hariharan:
Thanx for replying. In my scenario i'm not going to index any of my
docs.
So is there a way to find out term frequencies of the terms in a doc
without doing the indexing part?
Using an analyzer (Tokenstream) and a MapString, Integer?
while ((t =
On 11 Apr 2007 at 18:05, Erick Erickson wrote:
Rather than using a search, have you thought about using a TermEnum?
It's much, much, much faster than a query. What it allows you to do
is enumerate the terms in the index on a per-field basis. Essentially, this
is what happens when you do a
On 12 Apr 2007 at 7:13, Antony Bowesman wrote:
Steffen Heinrich wrote:
Normally an IndexWriter uses only one default Analyzer for all its
tokenizing businesses. And while it is appearantly possible to supply
a certain other instance when adding a specific document there seems
to be no
On 12 Apr 2007 at 0:28, karl wettin wrote:
11 apr 2007 kl. 22.32 skrev Steffen Heinrich:
According to occasional references on this list some people have
already tried to implement such a search with lucene but did they
succeed?
My first idea was to run every completed token of the
Are KeywordAnalyser and Field.Index.UN_TOKENIZED synonymous.
i.e if I create an IndexWriter with a KeywordAnalyser does it make any
difference whether I index my fields within documents added to this
index with Field.Index.UN_TOKENIZED or Field.Index.TOKENIZED
thanks paul
See below
On 4/12/07, Steffen Heinrich [EMAIL PROTECTED] wrote:
On 11 Apr 2007 at 18:05, Erick Erickson wrote:
Rather than using a search, have you thought about using a TermEnum?
It's much, much, much faster than a query. What it allows you to do
is enumerate the terms in the index on
12 apr 2007 kl. 12.19 skrev Steffen Heinrich:
The intended system however can not be trained by user input. The
suggestions have to come from a given corpus (e.g. an ocasionally
updated product database).
Do you think adopting your package to set up the tries from a corpus
would be fairly
You have to refresh your IndexSearcher periodically.
Tony
From: anson [EMAIL PROTECTED]
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: How to update index dynamically
Date: Mon, 09 Apr 2007 18:25:57 +0900
I have build a blog project under tomcat5.5 with Lucene2.0.
And I want to
All,
Sorry for long email. I have two questions on indexing. My data consists of
an id, short headline and story text. Story text has some html tags. Here is
an example.
In early 2005, it seemed that Shamita Shetty had finally arrived after a
high profile debut in iMohabbatein/i [2000]. br
I think you are right. But for sanity, if you really want the field to be
untokenized, use F.I.UN_TOKENIZED.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message
From: Paul Taylor [EMAIL
I am trying to make a ftp search engine for searching filenames(not the
content). I am thinking of using apache net commons for acessing ftp servers
and want to implement the indexing,searching part using lucene. Can
anyone tell me how to use lucene in this context.
docfreqs (idfs) do not take into account deleted docs.
This is more of an engineering tradeoff rather than a feature.
If we could cheaply and easily update idfs when documents are deleted
from an index, we would.
Wow. So is it fair to say that the stored IDF is really the
cumulative IDF for
Hi Tony,
Your code looks fine to me. I'm not sure what you timed - the whole app run,
just indexing, indexing + optimizing... If you times indexing + optimizing,
leave optimization out of the timer. How long do you think this should take?
Try setting maxBufferedDocs to 90.
Otis
. . . . .
On 4/12/07, Bill Janssen [EMAIL PROTECTED] wrote:
docfreqs (idfs) do not take into account deleted docs.
This is more of an engineering tradeoff rather than a feature.
If we could cheaply and easily update idfs when documents are deleted
from an index, we would.
Wow. So is it fair to say
Another question is if I can delete document based on storyIndentity field
(
using IndexReader.deleteDocuments(term)). Since storyIdentity field is not
indexed, is there any performance issue or I should index it too (and
store
it)?
As to your very last question, No, there'll be no
The issue is solved. Luke was very helpful in debugging, infact it helped to
identify a very basic mistake we were making.
Lokeya wrote:
I solved the issue by using:
1.Same Analyser.
2.Making indexing by tokenizing terms.
Now issue with the following code is, I am facing issues which
I have one million records to index, each of which have Tiltle,
Desciption and Identifier. If take each document and try to index these
fields my program was very slow. So I took 100,000 records and get the value
of these fields, add them to the addDocument() method. Then I use the Index
writer
Don't do that G Why are you trying to open the index 700,000 times?
During indexing or searching? In either case, there's no reason
to. You should be able to open the index and keep it open as long
as you want.
I still don't understand why you can't index the records individually,
but I'll
On 12 Apr 2007 at 9:27, Erick Erickson wrote:
See below
...
Not quite. As I understand your problem, you want all the terms that
match (or at least a subset) for a field. For this, WildcardTermEnum
is really all you need. Think of it this way...
(Wildcard)TermEnum gives you a list of all
The difference between IndexReader.maxDoc() and numDocs() tells you
how many documents have been marked for deletion but still take up
space in the index.
But not which terms have an odd IDF value because of those deleted
documents. How much does the IDF value contribute to the score in
12 apr 2007 kl. 20.00 skrev Steffen Heinrich:
This search is only meant to be used in an ajax-driven web
application.
And the basic idea is to give the user incentive and turn him to
something new, something he didn't think of before.
I just generalized on the concept in a mail to Erick under
karl wettin [EMAIL PROTECTED] wrote on 12/04/2007 00:25:47:
12 apr 2007 kl. 09.12 skrev sai hariharan:
Thanx for replying. In my scenario i'm not going to index any of my
docs.
So is there a way to find out term frequencies of the terms in a doc
without doing the indexing part?
Using
On 12 Apr 2007 at 20:22, karl wettin wrote:
12 apr 2007 kl. 20.00 skrev Steffen Heinrich:
This search is only meant to be used in an ajax-driven web
application.
And the basic idea is to give the user incentive and turn him to
something new, something he didn't think of before.
I
To cover all possible non-indexing overhead, better measure with something
like this:
static long indexContents(IndexWriter writer, List storyContentList)
throws IOException {
long res = 0;
if (storyContentList != null storyContentList.size() != 0) {
try {
: But not which terms have an odd IDF value because of those deleted
: documents. How much does the IDF value contribute to the score in
: search?
all idf's are affected equally, because the 'numDocs value used is
allways the same ... it really shouldn't affect the scores from a query,
it just
: This should be the same for Lucene 2.0 and 2.1.
:
: I understand. But I think we could well come accross this issue
: with Lucene 2.1 than 2.0?
i'm not understanding this part of the thread ... are you saying that if
you have two identical setups, the only difference being that one uses 2.0
On 4/12/07, Chris Hostetter [EMAIL PROTECTED] wrote:
: But not which terms have an odd IDF value because of those deleted
: documents. How much does the IDF value contribute to the score in
: search?
all idf's are affected equally, because the 'numDocs value used is
allways the same
There
Chris Hostetter [EMAIL PROTECTED] wrote on 12/04/2007 15:22:20:
: But not which terms have an odd IDF value because of those deleted
: documents. How much does the IDF value contribute to the score in
: search?
all idf's are affected equally, because the 'numDocs value used is
allways the
: But if now the index goes through a massive update, where almost all the
: docs containing TC are deleted, and TC is not in any newly added doc,
: practically TC becomes rare too, and hence D2 should probably be scored
: higher than D1. But IDF(TC) might not (yet) reflect the massive docs
:
Chris,
i'm not understanding this part of the thread ... are you saying that if
you have two identical setups, the only difference being that one uses 2.0
and the other uses 2.1, then you see different idfs after
adding/deleting/re-adding many docs?
Exactly. Please try to run the program
On 4/12/07, Koji Sekiguchi [EMAIL PROTECTED] wrote:
Chris,
i'm not understanding this part of the thread ... are you saying that if
you have two identical setups, the only difference being that one uses 2.0
and the other uses 2.1, then you see different idfs after
adding/deleting/re-adding
I found some discussions of this question from back in 2003, but that was
many updates ago.
I have built an index using the standard stop analyser which uses the
standard list of stop words. will and :the are stop words.
As I understand analyzers and phrase queries, when I search for
you will
Is the index completely removed between the 2.0 and 2.1 runs?
Sure. If you see my program, you'll find I'm using RAMDirectory.
regards,
Koji
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
On 4/12/07, Koji Sekiguchi [EMAIL PROTECTED] wrote:
Is the index completely removed between the 2.0 and 2.1 runs?
Sure. If you see my program, you'll find I'm using RAMDirectory.
OK, I think it's due to the change in merge policy.
Lucene 2.0 could under-merge (not enough) or over-merge
I know this is a relatively fundamental thing to arrange, but I'm having
trouble.
Can I instantiate a standard analyzer with an argument containing my own
stop words? If so, how? Will they be appended to or override the built-in
stop words?
Or, do I have to modify the analyzer class itself
Michael Barbarelli wrote:
Can I instantiate a standard analyzer with an argument containing my own
stop words? If so, how? Will they be appended to or override the built-in
stop words?
You can do it with one of the alternate constructors, and they'll
override the build-in list.
---
: Michael Barbarelli wrote:
: Can I instantiate a standard analyzer with an argument containing my own
: stop words? If so, how? Will they be appended to or override the built-in
I'm relly suprised how often this question gets asked ... Michael (or
anyone else for that matter) do you have
38 matches
Mail list logo