: Michael Barbarelli wrote:
: > Can I instantiate a standard analyzer with an argument containing my own
: > stop words? If so, how? Will they be appended to or override the built-in
I'm relly suprised how often this question gets asked ... Michael (or
anyone else for that matter) do you have a
Michael Barbarelli wrote:
Can I instantiate a standard analyzer with an argument containing my own
stop words? If so, how? Will they be appended to or override the built-in
stop words?
You can do it with one of the alternate constructors, and they'll
override the build-in list.
---
String
I know this is a relatively fundamental thing to arrange, but I'm having
trouble.
Can I instantiate a standard analyzer with an argument containing my own
stop words? If so, how? Will they be appended to or override the built-in
stop words?
Or, do I have to modify the analyzer class itself and
On 4/12/07, Koji Sekiguchi <[EMAIL PROTECTED]> wrote:
> Is the index completely removed between the 2.0 and 2.1 runs?
Sure. If you see my program, you'll find I'm using RAMDirectory.
OK, I think it's due to the change in merge policy.
Lucene 2.0 could under-merge (not enough) or over-merge (b
> Is the index completely removed between the 2.0 and 2.1 runs?
Sure. If you see my program, you'll find I'm using RAMDirectory.
regards,
Koji
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMA
As I understand it, there really is no "space indicator". I think of it
as replacing the stop word with a space, which is then discarded.
so, you're indexing 'you find answer', and both your searches are
looking for 'you find answer', the stop words are just gone as though
they never were. So bo
I found some discussions of this question from back in 2003, but that was
many updates ago.
I have built an index using the standard stop analyser which uses the
standard list of stop words. "will" and :the" are stop words.
As I understand analyzers and phrase queries, when I search for
you wi
On 4/12/07, Koji Sekiguchi <[EMAIL PROTECTED]> wrote:
Chris,
> i'm not understanding this part of the thread ... are you saying that if
> you have two identical setups, the only difference being that one uses 2.0
> and the other uses 2.1, then you see different idfs after
> adding/deleting/re-add
Chris,
i'm not understanding this part of the thread ... are you saying that if
you have two identical setups, the only difference being that one uses 2.0
and the other uses 2.1, then you see different idfs after
adding/deleting/re-adding many docs?
Exactly. Please try to run the program whic
: But if now the index goes through a massive update, where almost all the
: docs containing TC are deleted, and TC is not in any newly added doc,
: practically TC becomes rare too, and hence D2 should probably be scored
: higher than D1. But IDF(TC) might not (yet) reflect the massive docs
: dele
Chris Hostetter <[EMAIL PROTECTED]> wrote on 12/04/2007 15:22:20:
>
> : But not which terms have an odd IDF value because of those deleted
> : documents. How much does the IDF value contribute to the "score" in
> : search?
>
> all idf's are affected equally, because the 'numDocs" value used is
>
On 4/12/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: But not which terms have an odd IDF value because of those deleted
: documents. How much does the IDF value contribute to the "score" in
: search?
all idf's are affected equally, because the 'numDocs" value used is
allways the same
The
: > This should be the same for Lucene 2.0 and 2.1.
:
: I understand. But I think we could well come accross this issue
: with Lucene 2.1 than 2.0?
i'm not understanding this part of the thread ... are you saying that if
you have two identical setups, the only difference being that one uses 2.0
: But not which terms have an odd IDF value because of those deleted
: documents. How much does the IDF value contribute to the "score" in
: search?
all idf's are affected equally, because the 'numDocs" value used is
allways the same ... it really shouldn't affect the scores from a query,
it jus
: If i rememebr correctly (you'll have to test this) sorting on a field
: which doesn't exist for every doc does what you would want (docs with
: values are listed before docs without)
: The actual behavior is different than described above. I modified
: TestSort.java:
: The actual order of the
To cover all possible non-indexing overhead, better measure with something
like this:
static long indexContents(IndexWriter writer, List storyContentList)
throws IOException {
long res = 0;
if (storyContentList != null && storyContentList.size() != 0) {
try {
Inferring out on the end of a long and fragile limb. Do you get
information from the database in any of the calls in your indexing
loop? That is, do any of...
itr.next();
content.getStoryText(),
content.getStoryIdentity()
content.getHeadline1()
go out to the DB to get info, and could
> I tried to index it. It took from 7-10 seconds to index about 90
documents.
That would be around 10 documents per second - way too slow. A Lucene's
perf test adding 12,000 docs sized similar to your sample doc (1400
characters) on a not so strong machine shows much faster pace - 146 docs
per sec
I'm copying this reply from a topic with the same title from the defunct
'lucene-user' list. My comments follow it.
: I thought of putting empty strings instead of null values but I think
: empty strings are put first in the list while sorting which is the
: reverse of what anyone would want.
in
Otis,
I timed just for indexing.
thanks,
Tony
From: Otis Gospodnetic <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: Index performance
Date: Thu, 12 Apr 2007 09:31:49 -0700 (PDT)
Hi Tony,
Your code looks fine to me. I'm not sure what you timed - the whole a
Eric,
Thanks for the information. The id is generated by database and it is
unique. So I only need to index it and don't need to store it, right
Tony
From: "Erick Erickson" <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: Re: Index performance
Date: Thu, 12 Apr
On 12 Apr 2007 at 20:22, karl wettin wrote:
>
> 12 apr 2007 kl. 20.00 skrev Steffen Heinrich:
>
> > This search is only meant to be used in an ajax-driven web
> > application.
> > And the basic idea is to give the user incentive and turn him to
> > something new, something he didn't think of bef
karl wettin <[EMAIL PROTECTED]> wrote on 12/04/2007 00:25:47:
>
> 12 apr 2007 kl. 09.12 skrev sai hariharan:
>
> > Thanx for replying. In my scenario i'm not going to index any of my
> > docs.
> > So is there a way to find out term frequencies of the terms in a doc
> > without doing the indexing p
12 apr 2007 kl. 20.00 skrev Steffen Heinrich:
This search is only meant to be used in an ajax-driven web
application.
And the basic idea is to give the user incentive and turn him to
something new, something he didn't think of before.
I just generalized on the concept in a mail to Erick under t
On 12 Apr 2007 at 16:49, karl wettin wrote:
>
> 12 apr 2007 kl. 12.19 skrev Steffen Heinrich:
>
> >
> > The intended system however can not be trained by user input. The
> > suggestions have to come from a given corpus (e.g. an ocasionally
> > updated product database).
> > Do you think adopting
> The difference between IndexReader.maxDoc() and numDocs() tells you
> how many documents have been marked for deletion but still take up
> space in the index.
But not which terms have an odd IDF value because of those deleted
documents. How much does the IDF value contribute to the "score" in
s
On 12 Apr 2007 at 9:27, Erick Erickson wrote:
> See below
...
> Not quite. As I understand your problem, you want all the terms that
> match (or at least a subset) for a field. For this, WildcardTermEnum
> is really all you need. Think of it this way...
> (Wildcard)TermEnum gives you a list of
Don't do that Why are you trying to open the index 700,000 times?
During indexing or searching? In either case, there's no reason
to. You should be able to open the index and keep it open as long
as you want.
I still don't understand why you can't index the records individually,
but I'll assume
Thanks for your suggestion. I used Luke to debug and found the issue.
I have one million records to index, each of which have "Tiltle",
"Desciption" and "Identifier". If take each document and try to index these
fields my program was very slow. So I took 100,000 records and get the value
of these
I have one million records to index, each of which have "Tiltle",
"Desciption" and "Identifier". If take each document and try to index these
fields my program was very slow. So I took 100,000 records and get the value
of these fields, add them to the addDocument() method. Then I use the Index
wri
The issue is solved. Luke was very helpful in debugging, infact it helped to
identify a very basic mistake we were making.
Lokeya wrote:
>
> I solved the issue by using:
>
> 1.Same Analyser.
> 2.Making indexing by tokenizing terms.
>
> Now issue with the following code is, I am facing issues
Another question is if I can delete document based on storyIndentity field
(
using IndexReader.deleteDocuments(term)). Since storyIdentity field is not
indexed, is there any performance issue or I should index it too (and
store
it)?
As to your very last question, No, there'll be no performance
On 4/12/07, Bill Janssen <[EMAIL PROTECTED]> wrote:
> docfreqs (idfs) do not take into account deleted docs.
> This is more of an engineering tradeoff rather than a feature.
> If we could cheaply and easily update idfs when documents are deleted
> from an index, we would.
Wow. So is it fair to
Hi Tony,
Your code looks fine to me. I'm not sure what you timed - the whole app run,
just indexing, indexing + optimizing... If you times indexing + optimizing,
leave optimization out of the timer. How long do you think this should take?
Try setting maxBufferedDocs to 90.
Otis
. . . . .
> docfreqs (idfs) do not take into account deleted docs.
> This is more of an engineering tradeoff rather than a feature.
> If we could cheaply and easily update idfs when documents are deleted
> from an index, we would.
Wow. So is it fair to say that the stored IDF is really the
cumulative IDF f
I am trying to make a ftp search engine for searching filenames(not the
content). I am thinking of using apache net commons for acessing ftp servers
and want to implement the indexing,searching part using lucene. Can
anyone tell me how to use lucene in this context.
I think you are right. But for sanity, if you really want the field to be
untokenized, use F.I.UN_TOKENIZED.
Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ - Tag - Search - Share
- Original Message
From: Paul Taylor <[EMAIL PROTEC
All,
Sorry for long email. I have two questions on indexing. My data consists of
an id, short headline and story text. Story text has some html tags. Here is
an example.
In early 2005, it seemed that Shamita Shetty had finally arrived after a
high profile debut in Mohabbatein [2000]. With 3
You have to refresh your IndexSearcher periodically.
Tony
From: anson <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: How to update index dynamically
Date: Mon, 09 Apr 2007 18:25:57 +0900
I have build a blog project under tomcat5.5 with Lucene2.0.
And I want to
12 apr 2007 kl. 12.19 skrev Steffen Heinrich:
The intended system however can not be trained by user input. The
suggestions have to come from a given corpus (e.g. an ocasionally
updated product database).
Do you think adopting your package to set up the tries from a corpus
would be fairly easy
See below
On 4/12/07, Steffen Heinrich <[EMAIL PROTECTED]> wrote:
On 11 Apr 2007 at 18:05, Erick Erickson wrote:
> Rather than using a search, have you thought about using a TermEnum?
> It's much, much, much faster than a query. What it allows you to do
> is enumerate the terms in the index
Are KeywordAnalyser and Field.Index.UN_TOKENIZED synonymous.
i.e if I create an IndexWriter with a KeywordAnalyser does it make any
difference whether I index my fields within documents added to this
index with Field.Index.UN_TOKENIZED or Field.Index.TOKENIZED
thanks paul
--
On 12 Apr 2007 at 0:28, karl wettin wrote:
>
> 11 apr 2007 kl. 22.32 skrev Steffen Heinrich:
>
> > According to occasional references on this list some people have
> > already tried to implement such a search with lucene but did they
> > succeed?
> >
> > My first idea was to run every completed t
On 12 Apr 2007 at 7:13, Antony Bowesman wrote:
> Steffen Heinrich wrote:
> > Normally an IndexWriter uses only one default Analyzer for all its
> > tokenizing businesses. And while it is appearantly possible to supply
> > a certain other instance when adding a specific document there seems
> >
On 11 Apr 2007 at 18:05, Erick Erickson wrote:
> Rather than using a search, have you thought about using a TermEnum?
> It's much, much, much faster than a query. What it allows you to do
> is enumerate the terms in the index on a per-field basis. Essentially, this
> is what happens when you do a P
12 apr 2007 kl. 09.12 skrev sai hariharan:
Thanx for replying. In my scenario i'm not going to index any of my
docs.
So is there a way to find out term frequencies of the terms in a doc
without doing the indexing part?
Using an analyzer (Tokenstream) and a Map?
while ((t = ts.next)!=null)
Hi,
Thanx for replying. In my scenario i'm not going to index any of my docs.
So is there a way to find out term frequencies of the terms in a doc
without doing the indexing part?
Thanx in advance,
Hari
On 4/12/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
Add Term Vectors to your Field durin
47 matches
Mail list logo