bad link in mailing list archive?

2003-11-25 Thread Gerret Apelt
When I load

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=2648

there are three replies listed at the bottom of the page, one by Otis 
Gospodnetic. The subject of his reply is "Concurency in Lucene".

When I click on Otis' reply, my browser loads a post with a different 
subject; "Problems compiling the java source codes of lucene search engine".

Just a heads up -- seems like something funky's going on :)

cheers,
Gerret
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Score

2003-11-25 Thread Gerret Apelt
Pleasant, Tracy wrote:

I tried using Boost but that did absolutely nothing.

The documents I am using: 
Plain text
PDF Documents
(I have two indexes) 
 

I'm not sure what's causing your scores to be off -- unless, of course, 
your scores just look wrong to you but they're in fact just what you 
should be getting :)
One bug in my code was that for an unrelated reason, terms in one field 
would never be matched. But since other fields contained the same term, 
the document was still being reported as a hit -- with a 
lower-than-expected score. Maybe you want to double check that the 
content of each field is getting tokenized properly.. when you have a 
term t in the title field that is unique to a particular document (i.e. 
not contained in any of the other fields of that document) do you still 
get a hit on the document when searching for t?Boost factors don't help 
of course if there's no hit in the first place.

When you say you use different analyzers for different fields in your
index, how would you accomplish that? When I create the index it has a
parameter for analyzer.. unless you create different indexes , how do
you use two different ones? 
 

Use PerFieldAnalyzerWrapper:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
cheers
Gerret


-Original Message-
From: Gerret Apelt [mailto:[EMAIL PROTECTED]
Sent: Monday, November 24, 2003 3:25 PM
To: Lucene Users List
Subject: Re: Score
Tracey --

it would help if you could give more detail on the types of documents, 
fields and analyzers you're using. Also what do you mean by "Multi Field

Search"? I presume you're using the MultiFieldQueryParser to have query 
terms in a user-submitted query be searched for in each field in your
index.

If I am understanding your problem, then it might be the same one I had 
a few weeks ago -- highly relevant matches would not receive a high 
ranking. (This paragraph will apply to you only if you use more than 
just one Analyzer for the set of your fields). I had six fields in my 
index, most of which were populated with a standard analyzer. I used 
self-made Analyzers for two of the fields. This turned out to be my 
problem when using MultiFieldQueryParser: I told my 
MultiFieldQueryParser instance to use only the standard analyzer. 
Instead I discovered that I needed to make use of 
org.apache.lucene.analysis.PerFieldAnalyzerWrapper and feed that to the 
MultiFieldQueryParser. Unless you do this, your problem is whats 
described here: 
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.in
dexing&toc=faq#q15.

Most likely, if your scoring is off, you're "doing something wrong" in 
the way you use the Lucene API -- at least, thats what I've discovered 
to be the case when my ranking is off.

If you're interested in the nitty-gritty of how scoring is done, check 
this FAQ entry:
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.se
arch&toc=faq#q31

cheers,
Gerret
Pleasant, Tracy wrote:

 

Hi,

I'm using the Multi Field Search to search all the fields of my
documents during the search. 

When it returns results the scores are numerically low - .06, .17, etc.
I would think if I searched for "Dog" and there was a doc with "Dog" in
the title and several times in the contents of a document that it would
receive a score more like 1.0 or close to it.
Is there a way that I can tweak the score?

I tried using Boost but that did absolutely nothing.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Score

2003-11-24 Thread Gerret Apelt
Tracey --

it would help if you could give more detail on the types of documents, 
fields and analyzers you're using. Also what do you mean by "Multi Field 
Search"? I presume you're using the MultiFieldQueryParser to have query 
terms in a user-submitted query be searched for in each field in your index.

If I am understanding your problem, then it might be the same one I had 
a few weeks ago -- highly relevant matches would not receive a high 
ranking. (This paragraph will apply to you only if you use more than 
just one Analyzer for the set of your fields). I had six fields in my 
index, most of which were populated with a standard analyzer. I used 
self-made Analyzers for two of the fields. This turned out to be my 
problem when using MultiFieldQueryParser: I told my 
MultiFieldQueryParser instance to use only the standard analyzer. 
Instead I discovered that I needed to make use of 
org.apache.lucene.analysis.PerFieldAnalyzerWrapper and feed that to the 
MultiFieldQueryParser. Unless you do this, your problem is whats 
described here: 
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q15.

Most likely, if your scoring is off, you're "doing something wrong" in 
the way you use the Lucene API -- at least, thats what I've discovered 
to be the case when my ranking is off.

If you're interested in the nitty-gritty of how scoring is done, check 
this FAQ entry:
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.search&toc=faq#q31

cheers,
Gerret
Pleasant, Tracy wrote:

Hi,

I'm using the Multi Field Search to search all the fields of my
documents during the search. 

When it returns results the scores are numerically low - .06, .17, etc.
I would think if I searched for "Dog" and there was a doc with "Dog" in
the title and several times in the contents of a document that it would
receive a score more like 1.0 or close to it.
Is there a way that I can tweak the score?

I tried using Boost but that did absolutely nothing.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


understanding IR topics on this list [was: Re: Vector Space Model in Lucene?]

2003-11-15 Thread Gerret Apelt
Dror --

I just completed an introductory course in IR. I can recommend the 
textbook we used: "Managing Gigabytes: Compressing and Indexing 
Documents and Images". When I don't understand posts on this list I can 
typically look up the theory in that book, then come back to the list 
and have a better idea of whats going on. "Managing Gigabytes" appears 
to be getting good reviews from most readers, but I can't compare it to 
similar works as I haven't read any.

I've spent some time searching for websites that introduce advanced IR 
topics at a level that is less rigorous than academic papers. But I 
haven't really found anything I can recommend. Suggestions welcome :)

cheers,
Gerret
**
Dror Matalon wrote:
Hi,

I might be the only person on the list who's having a hard time
following this discussion. Would one of you wise folks care to point me
to a good "dummies", also known as an executive summary, resource about
the theoretical background of all of this. I understand the basic
premise of collecting the "words" and having pointers to documents and
weights, but beyond that ...
TIA,

Dror

On Fri, Nov 14, 2003 at 12:52:15PM -0500, Chong, Herb wrote:
 

i don't know of any open source search engine that incorporates interterm correlation. i have been looking into how to do this in Lucene and so far, it's not been promising. the indexing engine and file format needs to be changed. there are very few search engines that incorporate interterm correlation in any mathematically and linguistically rigorous manner. i designed a couple, but they were all research experiments.

if you are familiar with the TREC automatic adhoc track? my experiments with the TREC-5 to TREC-7 questions produced about 0.05 to 0.10 improvement in average precision by proper use of interterm correlation. my project at the time was cancelled after TREC-7 and so there haven't been any new developments.

Herb

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 12:39 PM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Herb

Hmm... Are you perhaps familiar with some open system which doesn't? I'm 
curious because one of my projects (already using Lucene) could benefit 
from such feature. Right now I'm using a bastardized version of Markov 
chains, but it's more of a hack...

--
Best regards,
Andrzej Bialecki
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: fuzzy searches

2003-11-11 Thread Gerret Apelt
Thomas Krämer wrote:

Is there an overview of the structure of the index of lucene despite 
of the javadoc or any other fast access to understanding what happens 
inside lucene?

You mean something like this?:

http://jakarta.apache.org/lucene/docs/fileformats.html

cheers,
Gerret
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: term counts during indexing

2003-11-06 Thread Gerret Apelt
Peter --

sorry for the delay; I just accidentally saw your reply in the mailing 
list archive -- mustave overlooked it in my inbox :(

Peter Keegan wrote:

As I understand it, the field text is being tokenized by the analyzer when
IndexWriter.addDocument is called. At this point, the tokens are indexed
and/or stored. Would it be possible for 'addDocument' to save and make the
_actual_ counts of 'tokens stored' and 'tokens indexed' available in either
the Document or IndexWriter object? I guess I may be turning this into a
feature request :)
 

Lucene uses an inverted index, so the index is based on a mapping from 
"term" instances to the documents that contain them, as opposed to 
"document" instances mapping to a list of terms contained in that 
document (which is a fancy way of saying, "Lucene doesn't store 
documents; filesystems do that").
So in terms of the index representation, Lucene could not simply add a 
"term count" parameter to the entry for a given document, because 
(unless we're talking about a stored field) there is no table in which 
such an entry could exist. You would need to add a totally new data 
structure to the index, which can store document properties for 
un-stored fields. This which sort of defeats the purpose of un-stored 
fields. It sounds wrong to have an un-stored field and store its termcount.

Here's a proposal for a hack you could do: write an Analyzer wrapper 
that counts tokens emitted by the Analyzer's TokenStream's next() 
method, which it is called by IndexWriter.addDocument(Document). When 
TokenStream.next() returns null, you can store the tokenCount that you 
have maintained in a file or database. This is fairly ugly but it has 
the advantage that it will work for for non-stored fields.

I doubt there will be much support for extending Lucene to store field 
properties for unstored fields. Maybe there could be another field type 
called TERMCOUNTED_FIELD? Maybe some of the core coders could comment.

Also, I can't find this method from the code snippit provided by Gerret (I'm
using v1.2):
 

String[] fieldTerms = doc.getValues(fieldName);
   

hmm, it must have been added later then:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Document.html
cheers,
Gerret


Thanks,
Peter
- Original Message - 
From: "Gerret Apelt" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, October 29, 2003 9:44 PM
Subject: Re: term counts during indexing

 

Peter Keegan wrote:

   

Is there a simple and efficient way of determining the number of tokens
added
to a document after adding each field ('Document.add), as a result of the
actions
of the Analyzer, without having to re-parse the field
 

Peter --

you can ask the Document instance.

Document doc = getDocumentInstanceFromSomewhere();
int termCount = 0;
Enumertion fields = doc.fields();
while (fields.hasMoreElements()) {
   Field field = (Field)fields.nextElement();
   String fieldName = field.name();
   String[] fieldTerms = doc.getValues(fieldName);
   termCount += fieldTerms.length;
}
System.out.println("The fields of the document together contain
"+termCount+" terms.");
Note that
1) I haven't tried to compile this code, so I'm not sure if it works
2) this will only work for those fields where field.isStored() == true.
If the field isnt stored in the index, then you don't have a choice but
to go back to the document.
[not sure on the following, so please correct me if in error:] Remember
that unStored fields are indexed, so you can query on them, but the
field terms themselves are not stored in the index. Therefore you cannot
count them by asking Lucene. A Lucene field instance also has no way to
reference the source of the terms that are added to it. The field
doesn't care where its terms came from. So if field.isStored() == false,
then for that particular field Lucene cannot tell you how many terms are
in it. You'll have to write your own code that analyzes the original
data source in this case.
   

Alternatively, is there a way to determine the number of tokens added
 

after
 

adding the document to the index ('IndexWriter.addDocument')?

 

Whether you want the termCount for a document before or after you add
the document to the index doesn't matter, so the answer is "see above".
cheers,
Gerret
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: term counts during indexing

2003-10-29 Thread Gerret Apelt
Clarification: in the text quoted below I meant to say ". choice but 
to go back to the _original data source_".

cheers,
Gerret
Gerret Apelt wrote:

Note that
1) I haven't tried to compile this code, so I'm not sure if it works
2) this will only work for those fields where field.isStored() == 
true. If the field isnt stored in the index, then you don't have a 
choice but to go back to the document.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: term counts during indexing

2003-10-29 Thread Gerret Apelt
Peter Keegan wrote:

Is there a simple and efficient way of determining the number of tokens
added
to a document after adding each field ('Document.add), as a result of the
actions
of the Analyzer, without having to re-parse the field
Peter --

you can ask the Document instance.

Document doc = getDocumentInstanceFromSomewhere();
int termCount = 0;
Enumertion fields = doc.fields();
while (fields.hasMoreElements()) {
   Field field = (Field)fields.nextElement();
   String fieldName = field.name();
   String[] fieldTerms = doc.getValues(fieldName);
   termCount += fieldTerms.length;
}
System.out.println("The fields of the document together contain 
"+termCount+" terms.");

Note that
1) I haven't tried to compile this code, so I'm not sure if it works
2) this will only work for those fields where field.isStored() == true. 
If the field isnt stored in the index, then you don't have a choice but 
to go back to the document.

[not sure on the following, so please correct me if in error:] Remember 
that unStored fields are indexed, so you can query on them, but the 
field terms themselves are not stored in the index. Therefore you cannot 
count them by asking Lucene. A Lucene field instance also has no way to 
reference the source of the terms that are added to it. The field 
doesn't care where its terms came from. So if field.isStored() == false, 
then for that particular field Lucene cannot tell you how many terms are 
in it. You'll have to write your own code that analyzes the original 
data source in this case.

Alternatively, is there a way to determine the number of tokens added after
adding the document to the index ('IndexWriter.addDocument')?
 

Whether you want the termCount for a document before or after you add 
the document to the index doesn't matter, so the answer is "see above".

cheers,
Gerret
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]