inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Joshua O'Madadhain
Incorporating inter-term correlation into Lucene isn't that hard; I've 
done it.  Nor is it incompatible with the vector-space model.  I'm not 
happy with the specific correlation metric that I picked, which is why 
I'm not eager to generally release the code I wrote, but I think that 
the basic mechanism that I came up with (query expansion via correlated 
terms, where the added terms were boosted according to the strength of 
the correlation) is fairly sound.  And I didn't need any changes to 
Lucene to do this.

You can get some details on the specific mechanism that I used here, if 
you're interested:

http://www.ics.uci.edu/~jmadden/research/index.html

(and go down to Fuzzy Term Expansion and Document Reweighting, about 
halfway down.)

If you decide that my ideas are interesting enough that you want to 
have a look at my code, let me know, and perhaps we can work something 
out.

Regards,

Joshua O'Madadhain

On Friday, Nov 14, 2003, at 09:52 US/Pacific, Chong, Herb wrote:

i don't know of any open source search engine that incorporates 
interterm correlation. i have been looking into how to do this in 
Lucene and so far, it's not been promising. the indexing engine and 
file format needs to be changed. there are very few search engines 
that incorporate interterm correlation in any mathematically and 
linguistically rigorous manner. i designed a couple, but they were all 
research experiments.

if you are familiar with the TREC automatic adhoc track? my 
experiments with the TREC-5 to TREC-7 questions produced about 0.05 to 
0.10 improvement in average precision by proper use of interterm 
correlation. my project at the time was cancelled after TREC-7 and so 
there haven't been any new developments.

 [EMAIL PROTECTED] Per 
Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, 
Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill 
Watterson
My opinions are too rational and insightful to be those of any 
organization.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: inter-term correlation [was Re: Vector Space Model in Lucene?]

2003-11-14 Thread Joshua O'Madadhain
Not sure what you mean by terms can't cross sentence boundaries.  If 
you're only using single-word terms, that's trivially true.  What is it 
that you're trying to achieve, exactly?  (Your comment makes it sound 
as though you simultaneously want and don't want sentence boundaries, 
so I'm confused.)

Joshua

On Friday, Nov 14, 2003, at 10:13 US/Pacific, Chong, Herb wrote:

if you didn't have to change the index then you haven't got all the 
factors needed to do it well. terms can't cross sentence boundaries 
and the index doesn't store sentence boundaries.

Herb...

-Original Message-
From: Joshua O'Madadhain [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 1:14 PM
To: Lucene Users List
Subject: inter-term correlation [was Re: Vector Space Model in Lucene?]
Incorporating inter-term correlation into Lucene isn't that hard; I've
done it.  Nor is it incompatible with the vector-space model.  I'm not
happy with the specific correlation metric that I picked, which is why
I'm not eager to generally release the code I wrote, but I think that
the basic mechanism that I came up with (query expansion via correlated
terms, where the added terms were boosted according to the strength of
the correlation) is fairly sound.  And I didn't need any changes to
Lucene to do this.
You can get some details on the specific mechanism that I used here, if
you're interested:
http://www.ics.uci.edu/~jmadden/research/index.html

(and go down to Fuzzy Term Expansion and Document Reweighting, about
halfway down.)
If you decide that my ideas are interesting enough that you want to
have a look at my code, let me know, and perhaps we can work something
out.
Regards,

Joshua O'Madadhain

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 [EMAIL PROTECTED] Per 
Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, 
Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill 
Watterson
My opinions are too rational and insightful to be those of any 
organization.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Joshua O'Madadhain
On Tuesday, Nov 11, 2003, at 11:05 US/Pacific, Marcel Stor wrote:

Stefan Groschupf wrote:
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match
in it. You group all documents with minimal distance together.
Would I be correct to say that you have to define a distance 
threshold
parameter in order to define when to build a new category for a certain
group?
Depends on the type of clustering algorithm.  Some clustering 
algorithms take the number of clusters as a parameter (in this case the 
algorithm may be run several times with different values, to determine 
the best value).  Other types of algorithms, such as hierarchical 
agglomerative clustering algorithms, work more as you suggest.

Regards,

Joshua O'Madadhain

 [EMAIL PROTECTED] Per 
Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, 
Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill 
Watterson
My opinions are too rational and insightful to be those of any 
organization.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Empty phrase search

2002-12-17 Thread Joshua O'Madadhain
Minh:

Assuming that the fields aren't binary (and thus capable of having any
arbitrary string in them), you should be able to come up with a short
string that will never appear in the field (emptystring, xyxyqu23,
etc.) even if there is no single character that would work as a marker.  
As an extra layer of insurance, you could throw out any documents whose
field only contained that string _as a substring_.

This may not be completely bulletproof, but it's pretty close.  :)

Joshua

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Tue, 17 Dec 2002, Minh Kama Yie wrote:

 Yep, thought of that and having argued it out with the lead on this, it
 would be suitable in my opinion but even I would concede that it is a hack
 in our case since there is no limit to the variety of characters that could
 appear as the value for the field which may be an empty string. Hence there
 isn't the 100% guarantee that there will never be the circumstance when this
 special character or String of characters we choose to represent empty
 fields will appear as a valid value.
 
 Thanks anyway though.
 
 Minh
 
 - Original Message -
 From: Joshua O'Madadhain [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, December 17, 2002 6:03 PM
 Subject: Re: Empty phrase search
 
 
  Minh:
 
  Why not just use a special character or string (one that won't appear in a
  non-empty field) to represent an empty field?  It doesn't have to appear
  in the user interface, if that's a concern; you could convert a query
  including FieldA:NULL (or whatever) into a query containing
  FieldA:emptyfield automatically.
 
  This allows you to finesse the entire issue of adding something to
  Lucene--which may be for the best anyway, since this is really just a
  special case of looking for fields whose contents have a specific
  characteristic.
 
  Good luck--
 
  Joshua O'Madadhain
 
   [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
   It's that moment of dawning comprehension that I live for--Bill Watterson
  My opinions are too rational and insightful to be those of any
 organization.
 
  On Tue, 17 Dec 2002, Minh Kama Yie wrote:
 
   Thanks for that Peter.
  
   Unfortunately I'm not looking for all documents but rather documents
 where
   the fields can be empty.
   Hence using the universal field wouldn't quite work.
   The empty:true approach is interesting however it effectively doubles
 the
   number of indexable fields for a document.
  
   Need to assess whether or not we want to support this feature I guess.
  
   Currently, considering Lucene's architecture this feature needs to be
   inherently supported rather than worked around with various fields I
 think
   for it to be used/done properly...
  
   Would anyone be able to point me in the general direction of where to
 look
   in the Lucene code to attempt this?
   Hopefully this way I can give a proper cost/benefit analysis as to
 whether
   or not to support this feature...and if all goes well, release something
   back for the great work you guys do
  
  
   - Original Message -
   From: Peter Carlson [EMAIL PROTECTED]
   To: Lucene Users List [EMAIL PROTECTED]
   Sent: Tuesday, December 17, 2002 4:25 PM
   Subject: Re: Empty phrase search
  
  
I don't think so.
   
One approach to look for everything, or not something is to add a
 field
to each document which is a constant value. Like a field named exists
and a value of true.
   
Then you can do search like
   
exists:true NOT microsoft
   
This will find all documents without the term microsoft in them.
   
Just to have it find documents with nothing might be a little tricky.
You might want to put a field in the document which indicates the size
or something like that. Or just create an empty field and look for
   
empty:true
   
I hope this rambling helps
   
--Peter
   
On Monday, December 16, 2002, at 03:24 PM, Minh Kama Yie wrote:
   
 Hi guys,

 Just wondering if lucene indexes empty strings and if so, how to
 search for this using the query language?


 Regards,

 Minh Kama Yie

 This message is intended only for the named recipient.
 If you are not the intended recipient you are notified that
 disclosing, copying, distributing or taking any action
 in reliance on the contents of this information is strictly
 prohibited.
   
   
--
To unsubscribe, e-mail:
   mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
   mailto:[EMAIL PROTECTED]
  
  
   --
   To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
   For additional commands

Re: Empty phrase search

2002-12-16 Thread Joshua O'Madadhain
Minh:

Why not just use a special character or string (one that won't appear in a
non-empty field) to represent an empty field?  It doesn't have to appear
in the user interface, if that's a concern; you could convert a query
including FieldA:NULL (or whatever) into a query containing
FieldA:emptyfield automatically.

This allows you to finesse the entire issue of adding something to
Lucene--which may be for the best anyway, since this is really just a
special case of looking for fields whose contents have a specific
characteristic.

Good luck--

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Tue, 17 Dec 2002, Minh Kama Yie wrote:

 Thanks for that Peter.
 
 Unfortunately I'm not looking for all documents but rather documents where
 the fields can be empty.
 Hence using the universal field wouldn't quite work.
 The empty:true approach is interesting however it effectively doubles the
 number of indexable fields for a document.
 
 Need to assess whether or not we want to support this feature I guess.
 
 Currently, considering Lucene's architecture this feature needs to be
 inherently supported rather than worked around with various fields I think
 for it to be used/done properly...
 
 Would anyone be able to point me in the general direction of where to look
 in the Lucene code to attempt this?
 Hopefully this way I can give a proper cost/benefit analysis as to whether
 or not to support this feature...and if all goes well, release something
 back for the great work you guys do
 
 
 - Original Message -
 From: Peter Carlson [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, December 17, 2002 4:25 PM
 Subject: Re: Empty phrase search
 
 
  I don't think so.
 
  One approach to look for everything, or not something is to add a field
  to each document which is a constant value. Like a field named exists
  and a value of true.
 
  Then you can do search like
 
  exists:true NOT microsoft
 
  This will find all documents without the term microsoft in them.
 
  Just to have it find documents with nothing might be a little tricky.
  You might want to put a field in the document which indicates the size
  or something like that. Or just create an empty field and look for
 
  empty:true
 
  I hope this rambling helps
 
  --Peter
 
  On Monday, December 16, 2002, at 03:24 PM, Minh Kama Yie wrote:
 
   Hi guys,
  
   Just wondering if lucene indexes empty strings and if so, how to
   search for this using the query language?
  
  
   Regards,
  
   Minh Kama Yie
  
   This message is intended only for the named recipient.
   If you are not the intended recipient you are notified that
   disclosing, copying, distributing or taking any action
   in reliance on the contents of this information is strictly
   prohibited.
 
 
  --
  To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
 
 --
 To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:[EMAIL PROTECTED]
 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Accentuated characters

2002-12-10 Thread Joshua O'Madadhain
On Tue, 10 Dec 2002, stephane vaucher wrote:

 I wish to implement a TokenFilter that will remove accentuated
 characters so for example 'é' will become 'e'. As I would rather not
 reinvent the wheel, I've tried to find something on the web and on the
 mailing lists. I saw a mention of a contrib that could do this (see
 http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg02146.html),
 but I don't see anything applicable.

It may depend on what kind of encoding you're working with.  (E.g., HTML
documents represent such characters in a different way than that of
Postscript documents.)  Probably the easiest way to handle this, if you
want to avoid such questions, would be to convert all your input documents
(and query text) to Java (Unicode) strings, and then do a
search-and-replace with the appropriate character-pair arguments.  (After
this is done, you would then do whatever Lucene processing (indexing,
query parsing, etc.) was appropriate.  I am not aware of any code that
does this, but it should be straightforward.
 
Good luck--

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene Speed under diff JVMs

2002-12-05 Thread Joshua O'Madadhain
On Thu, 5 Dec 2002, Armbrust, Daniel C. wrote:

 I'm using the class that Otis wrote (see message from about 3 weeks ago)
 for testing the scalability of lucene (more results on that later) and I
 first tried running it under different versions of Java, to see where it
 runs the fastest.  The class simply creates an index out of randomly
 generated documents.

 All of the following were running on a dual CPU 1 GHz PIII Windows 2000
 machine that wasn't doing much else during the benchmark.  The indexing
 program was single threaded, so it only used one of the processors of
 the machine.

[snip specific measurements]

 As you can see, the IBM jvm pretty much smoked Suns.  And beat out
 JRockit as well.  Just a hunch, but it wouldn't surprise me if search
 times were also faster under the IBM jdk.  Has anyone else come to this
 conclusion?

Just a brief note on performance measurements and statistical sampling: no
offense, but if these are measurements of a single trial of 1000 documents
for each JVM, they're not so different that I'd be willing to conclude
that one JVM is notably faster for this task than another.  The problem is
compounded by the fact that it can be hard to tell just how much CPU is
being taken up by OS tasks (and this can fluctuate quite a lot).  If you
really want to quote statistics like this, using 5 or 10 trials would give
a more accurate notion of the real performance differences (if any).

Casuistically :),

Joshua O'Madadhain

  [EMAIL PROTECTED] Per Obscuriuswww.ics.uci.edu/~jmadden
   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for.  -- Bill Watterson
 My opinions are too rational and insightful to be those of any organization.





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: StandardFilter that works for French

2002-11-21 Thread Joshua O'Madadhain
On Thu, 21 Nov 2002, Konrad Scherer wrote:

 In French you have 6 words (me, te, se, le/la , ne, de) where the e is
 replaced with an apostrophe when the following word starts with a vowel.
 For example me aider becomes m'aider. Currently Lucene indexes m'aider,
 s'aider, n'aider as different words when in fact they should be analyzed as
 me aider, se aider, ne aider, etc. So I modified Standard filter to send
 back these words as two words. I had to add a one Token buffer. I toyed
 with modifying StandardTokenizer.jj but I was worried about unintended
 changes in behavior.

 This change will not effect English indexing. The only change I can think
 of is that a word like m'lord would be indexed as me lord. Still it might
 be better to make a French package and add this to a French Filter.

There are a number of contractions in English that could be affected if
you're using the apostrophe as a marker, e.g.: isn't, wouldn't, I'd, he's,
hasn't.  (Granted, these are often considered stop words.)  Thus, I think
that your idea of incorporating this change into a French filter, rather
than modifying Standard filter, is a good idea.

Joshua O'Madadhain

  [EMAIL PROTECTED] Per Obscuriuswww.ics.uci.edu/~jmadden
   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for.  -- Bill Watterson
 My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Tags Screwing up Searches

2002-10-21 Thread Joshua O'Madadhain
On Mon, 21 Oct 2002, Terry Steichen wrote:

 I discovered that the actual text that I was dealing with already
 converted the '' converted to 'lt;', and so forth.  So the problem
 is that with something like 'lt;bgt;College Soccerlt;/bgt;',
 Lucene recognizes the trailing semi-colon ';' as a word separator, so
 it can find the term 'college', but it does not see the ending of
 'soccer'.  I did confirm that it *will* match on 'soccerlt;' just
 fine.
 
 I've proceeded to add a string substitution method which replaces
 'lt;' with ' ' (four spaces, in order to hopefully keep the offsets
 straight). It appears to work, though I believe it slows down the
 indexing.
 
 I don't know enough about the inner design of Lucene to figure this
 out, but it seems logical that there would be a much more efficient
 way to handle this than string operations.
 
 PS: I've had no responses from the list, so perhaps this is a unique
 problem and doesn't justify a formal fix effort.

A few questions and comments; please pardon me if I am asking questions
answered in previous email:

(1) Are you using an analyzer that is designed to handle (a) HTML, or
(b) plain text?

(2) If (b), that's probably why you've been getting this kind of behavior,
and you may want to look at the HTMLParser sample code in the
distribution.  The StandardAnalyzer, I'm pretty sure, is not designed to
handle HTML.

(3) A quick and dirty solution for indexing HTML if you are running on
some flavor of Unix and don't want to figure out how to do parse HTML
tags: the text web browser lynx.  lynx can 'dump' the text from a web
page out as follows:

cat foo.html | lynx -dump -nolist   foo.txt

This effectively strips the HTML tags out of foo.html and writes the text
of the page to the file foo.txt.

Once you've done this, of course, you can use the same analyzers that you
use for any unformatted text file.

Good luck--

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.





--
To unsubscribe, e-mail:   mailto:lucene-user-unsubscribe;jakarta.apache.org
For additional commands, e-mail: mailto:lucene-user-help;jakarta.apache.org




Re: Query modifier ?

2002-09-27 Thread Joshua O'Madadhain

On Fri, 27 Sep 2002, Ganael LAPLANCHE wrote:

 Does anyone know how to implement a query modifier with Lucene ? I
 would like to be able to use a dictionary to modify an invalid query
 (just as Google does) and to suggest a new query to the user... This
 isn't really a Lucene-related question, but if someone could help
 me...

The easiest way--or, at least, the method I used--is to do your own query
parsing: 

(1) take the input from the user as a String

(2) do whatever analysis or modifications on it that you like
(e.g., breaking it down into words using a StringTokenizer and checking
each word, by your favorite method, for misspellings)

(3) optionally let user review your modifications (return to (1)) 

(4) create a Term from each word from the modified query

(5) create a Query from your collection of Terms

That's the basic idea.  The details will depend on what kind of query you
want to do, and whether you want to allow the user to specify Boolean
modifiers, term boosts, etc.

It may be possible to use the standard QueryParser to parse the query and
then hack the Query that is returned, but I've never tried it.

Good luck--

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Comparing Intermedia and Lucene

2002-09-25 Thread Joshua O'Madadhain

On Wed, 25 Sep 2002, Peter Carlson wrote:

 I have used intermedia in the past and found that it had a few 
 advantages and disadvantages compared to Lucene.
 
 Advantages of Intermedia
 4) Supports term expansion and contraction
 
 Now items 3 and 4 of Intermedia advantages can be added as Features to 
 Lucene, but are not currently.

I'm not sure what you mean by supports term expansion.  In the IR
community, there are many different mechanisms for term expansion; perhaps
in the database community the term is more restrictive.  My understanding
is that the only thing you need in order to be able to expand terms (or
queries) is to be able to modify queries, which you can certainly do under
Lucene.  (A research project of mine uses Lucene as the core for a search
engine that does query expansion, among other things.)  Yes, I had to
write some code to do this, but static term expansion, in which each term
is expanded to a fixed list of other terms (which may be
specifiable/modifiable by administrators or users), is pretty
straightforward to code (my project used a considerably more sophisticated
mechanism).

If what you mean is that there is no specific API for term expansion in
Lucene, that's true, but I'm not sure how much value such an API would add
to Lucene.

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: GoogleQueryParser

2002-09-12 Thread Joshua O'Madadhain

This issue confused me somewhat when I first encountered it from the other
direction (looking at the inputs to, and behavior of, BooleanQuery).  

However, ultimately--as I understand it--what's going on here is that
while the AND, OR syntax suggests Boolean queries, the indexing and
searching engines use the vector model (which ranks documents by their
score on the query) rather than the Boolean model (which returns all
documents which satisfy the query).  It may be that the Boolean syntax is
more misleading than useful in some cases.

Part of the problem, too (as Peter pointed out), may be that the example
queries under discussion are not well-formed Boolean queries, so it's not
necessarily clear what their behavior *should* be.

Let's consider the example:

a b OR c

If I assume a default of AND, a AND (b OR c) seems plausible; so
does (a AND b) OR c; or perhaps (a AND c) OR (b AND c).

I believe that the vector model interpretation of this query may be this
(with AND default):

+a +b c

which means 

The more of the search terms 'a', 'b', 'c' that a document has, the
higher its score will be, all other things being equal; however, if a
document doesn't have 'a', or doesn't have 'b', give it score 0
regardless of any other factors.

This is probably most similar to (a AND c) OR (b AND c).

The Boolean interpretation of (a AND c) OR (b AND c) would be

Return all documents [in any order] that have *at least one of the
following* term sets: {'a', 'c'}; {'b', 'c'}.

which, if you think about it, gives you the same document set as the
vector model interpretation.


Personally, I avoid using the QueryParser entirely and just do my own
parsing and query construction.  Part of the reason for this is that my
code is doing term expansion and reweighting, but part of it is just that
I feel that I get more power and flexibility--and less opportunity for
ambiguity such as this--by doing my own parsing.  Your mileage may vary.  
:)

Regards,

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Thu, 12 Sep 2002, John Cwikla wrote:

 Actually to expand a little more after a little more digging, it
 appears that the AND/OR terms are being flattened into a list of
 + - or optional query terms that are used to remove/add results
 one after the other.
 
 In this sense, I think the AND, OR and NOT operators seems to have
 been an afterthought to the queryparser, since they cannot give the
 same results as +, - and optional.  AND, OR and NOT should use intersections
 and unions, while + and - is doing strict adds or rejections.
 
 I guess I was expecting a syntax tree, but it looks like just flattening
 of terms.
 
 cwikla
 
 -Original Message-
 From: John Cwikla [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, September 12, 2002 11:53 AM
 To: 'Lucene Users List'
 Subject: RE: GoogleQueryParser
 
 
 So here is the problem, AND default operator:
 
 a b OR c == a AND b OR c == c OR (a AND b) == c OR (+a +b) != c b +a != c +b
 +a
 
 However with default OR operator:
 
 a AND b c == a AND b OR c == c OR (a AND b) == c OR (+a +b) == c (+a +b)
 
 Since AND and OR do not actually mean required or optional in a strict
 boolean sense, I claim you cannot correctly use the query parser with
 a default AND operator and get results that would be expected.
 
 I haven't looked more into the QueryParser yet, but in the last case with
 the
 AND operator, if at some point the internal query switched to OR, then
 the last item would be correct if it had parenthesis like c (+a +b)
 
 Or is it early and I'm missing something?
 
 cwikla
 
 
 
 -Original Message-
 From: Halácsy Péter [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, September 11, 2002 11:58 PM
 To: Lucene Users List
 Subject: RE: GoogleQueryParser
 
 -Original Message-
 From: Philip Chan [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, September 11, 2002 11:04 PM
 To: Lucene Users List
 Subject: RE: GoogleQueryParser
 
 
 I think there's a bug, if I set the default operator to be OR, 
 when I run
 
 java org.apache.lucene.queryParser.QueryParser a AND b OR c
 
 it will give me the result of +a +b c
 if I set the default operator to be AND, and run it with the 
 term a b OR
 c, it will give me +a b c, which is different
 
 To be exact it's not a bug, it's feature ;) Well, the structured query
 language of Lucene (and Google and others) is not a strict boolean language.
 For example I think the QueryParser of Lucene do not support parenthesis: a
 AND (b OR C) 
 
 Instead of strict boolean logic it supports constraint on query terms: a
 query term is either required or optional or prohibited. If you write + sign
 before the term it will be required. If you write - it will be prohibited.
 The question is: is a term

Re: Newbie quizzes further...

2002-09-02 Thread Joshua O'Madadhain

On Mon, 2 Sep 2002, Stone, Timothy wrote:

 I have noted that Lucene fails to interpret numerous HTML entities,
 specifically entities in the 82xx range, i.e. #8212; (en-dash) and
 many others. Now this may not be a Lucene issue, I'm looking at the
 code as I post, but I'm curious to its origins and why they don't seem
 to be parsed properly in the index.

As I see it, there are two answers to this question.

(1) What gets parsed and indexed is your choice; there are various
different Analyzers that are included with the Lucene package, which have
different effects.  You could conceivably construct an Analyzer that would
parse such entities as you describe.

(2) Historically punctuation has not been parsed by search engines, for
the simple reason that it doesn't tend to add much to search precision and
it complicates the indexing process.  (On the other hand, if you're
talking about accents and non-English letters, I understand that some
people have written analyzers that cover these things; check out the
contrib section on the Lucene website.)

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: text format and scoring

2002-08-02 Thread Joshua O'Madadhain

On Sat, 3 Aug 2002, petite_abeille wrote:

 I was wandering what would be a good way to incorporate text format 
 information in Lucene word/document scoring. For example, when turning 
 HTML into plain text for indexing purpose, a lot of potentially useful 
 information are lost: eg tags like bold, strong and so on could be 
 understood as conveying emphasis information about some words. If 
 somebody took the pain to underline some words, why throw it away? 
 Assuming there is some interesting meaning in a document format/layout, 
 and a way to understand it and weight it, how could one incorporate this 
 information into document scoring?

If you can boost terms as they are indexed (I can't remember if this is
possible, but you can certainly do so on queries) then that might be a
good way of doing it; it's not so much a matter of changing document
scores (on the back end, with respect to a particular query) as it is of
changing the weighting of terms (on the front end).

I've just glanced through the API and I don't see a way to do term
boosting during indexing, but maybe there's something I've missed.  
Anyone?

Regards,

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Wrong spelling

2002-07-24 Thread Joshua O'Madadhain

On Wed, 24 Jul 2002, Olivier Amira wrote:

 I would like to implement in my Lucene application a google's like 
 feature like the Did you mean google's feature. So, when the user 
 enters a wrong spelling of a word, the search engine automatically 
 propose a similar better word. To implement such function in a Lucene 
 application, I'm not sure of what method is the best (or it's correct to 
 try to di this with a Lucene index). Is there anybody that could help-me 
 for this?

There are a couple of different approaches to this that I'm aware of.

(1) Find a list of commonly misspelled words, detect them in a query, and
prompt the user with the corresponding correctly spelled words.  Such
lists are pretty common.  Advantages: reasonably easy to implement,
computationally cheap, and most of the work (figuring out what words to
flag and what words to suggest in their place) is done statically.  
Disadvantages: it will catch 'speling' mistakes but not 'spellling'
mistakes (that is, it will only recognize errors that you tell it about).
This is entirely independent of the index unless you go to the trouble of
removing entries from this auxiliary data structure that correspond to
words that aren't in the index anyway.

(2) There's something in the Lucene API docs about a FuzzyQuery that
mentions Levenshtein distance (= string edit distance, I believe).  I
haven't looked into this myself, but I would guess that you should be able
to construct a FuzzyQuery that specifies a maximum string edit distance
between a specified search term and other terms in the index.  
Unfortunately, the API docs are just about that helpful; FuzzyTermEnum has
more information but doesn't tell you how to use FuzzyQuery.  On the other
hand at least you now know where to look in the source code.  :)
Advantages: more flexible, seems like it's built in; disadvantages: docs
not helpful, will probably slow your query down more than (1) would.

You could also try to write your own string edit distance calculator/data
structure, but I don't have any quick answers as to how to do that.

Good luck--

Joshua O'Madadhain 

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: contains

2002-07-22 Thread Joshua O'Madadhain

On Tue, 16 Jul 2002, Lothar Simon wrote:

 Just to correct a few points:
 - The factor would be 2 * (average no of chars per word)/2 = (average
 no of chars per word).

Actually, I made a mistake in my earlier analysis, but your factor is also
inaccurate.  Ignoring other factors and overhead, storing just the set of
strings S = {s_1, s_2, ..., s_n} with corresponding lengths L = {l_1, l_2,
..., l_n} requires

(sum_i l_i) [sum of all lengths in L]

= (mean(L) * n) space
[where mean(L) = the above sum divided by the size of L (n)].


Storing all prefixes of each string requires

(sum_i sum_j j) [where j varies from 1 to l_i]

= (sum_i (l_i(l_i + 1))/2)

 1/2(sum_i (l_i^2)).

= (sum_i (l_i^2)) for both prefixes and suffixes

!= (l_i * sum_i l_i) [because you can't take the extra factor of l_i 
outside of the sum]

The mistake that you and I both made is this: the sum of the squares of
the mean lengths is not the same as the mean length squared times n.
  
E.g.: L = {2,3,4,6,7,8} [mean 5]
sum_i l_i = 30
2 * (sum_i sum_j j) = 262; 262/30 = 8.73 [correct multiplicative factor]
2 * mean(L)^2 = 180; 180/30 = 6 [this is *wrong*]
and the original hypothesis was that the additional factor should have
been the mean, 5.

Some additional calculations suggest that if you assume an exponential
distribution of word lengths (cf. Zipf's Law) that basing your guess on a
mean word length will cause you to underestimate by a factor of something
like 17%.

This is all just a fancy way of demonstrating that having long strings
hurts you more than having short ones helps you, i.e., using a mean value
in place of the sum is inaccurate in general.

 - One would probably create a set of 2 * (maximum number of chars per word)
 as Fields for a document. If this could work was actually my question...
 - Most important: my proposal is exactly (and almost only) designed to solve
 the substring (*uti*) problem !!! One field in the first group of fields
 in my example contains utiful and would be found by uti*, a field in the
 other group of fields contains itueb and would be found by itu*. Voila!

You are correct; I goofed.
 
 I still think my idea would work (given you spend the space for the index).

I still don't see how you deal with the problem that I mentioned before:

'Another problem with this is that in order to be able to get from ful
to beautiful, you have to store, in the index entry for ful, (pointers
to) every single complete word in your document set that contains ful as
a prefix or suffix.  Just _creating_ such an index would be extremely
time-consuming even with clever data structures, and consider how much
extra storage for pointers would be necessary for entries like e or
n.'

In any case, I personally would consider the expected overhead of space to
be prohibitive.  However, so long as you address the remaining issue I
just mentioned, yes, I think that your scheme would work.

Regards,

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: contains

2002-07-12 Thread Joshua O'Madadhain

On Fri, 12 Jul 2002, Lothar Simon wrote:

[in response to Peter Carlson pointing out that searching for *xyz* is a
difficult problem]
 Of course you are right. And I am surely more the last then the first
 one to try to come up with THE solution for this. But still... Could
 the following work?
 
 If space (ok, a lot) is available you could store beutiful,
 eutiful, utiful, tiful, iful, ful, ul, l PLUS its
 inversions (lufitueb, ufitueb, fitueb, itueb, tueb, ueb,
 eb, b) in the index. Space needed would be something like (average
 no of chars per word) as much as in a normal index.

Actually it would be twice that, because you're storing backward and
forward versions.  I'd hazard a guess that this factor alone would mean
something like a 10- or 12-fold increase in index size (the average length
of a word is less than 5 or 6 letters, but by throwing out stop words you
throw out a lot of the words that drag the average down).

Another problem with this is that in order to be able to get from ful to
beautiful, you have to store, in the index entry for ful, (pointers
to) every single complete word in your document set that contains ful as
a substring.  Just _creating_ such an index would be extremely
time-consuming even with clever data structures, and consider how much
extra storage for pointers would be necessary for entries like e or n.

Finally, you're not including all substrings: your scheme doesn't allow me
to search for *uti* and find beautiful.  If you did, the number of
entries would then be multiplied by a factor of the _square_ of the
average number of characters per word.  (You might be able to avoid this
by doing prefix and suffix searches--which are difficult but less so--on
the strings you specify, though.)

There might be some clever way to get around these problems, but I suspect
that developing one would be a dissertation topic.  :)

Regards,

Joshua O'Madadhain
 
 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: contains

2002-07-10 Thread Joshua O'Madadhain

Pradeep:

I think what Peter was trying to get at was the question when is it
useful for a search engine user to be able to search for words that
contain a particular letter?

For a language like Chinese, it would certainly be useful to be able to
search for a single character.  However, the informational content of a
single letter in an alphabet-based language (such as English) is so low
that I have trouble believing that it would be useful to be able to do
this kind of search.

That is to say: unless this feature has been presented to you as a
requirement, you may want to think about how it might be used in practice
before you spend a lot of time implementing it.

Regards,

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Thu, 11 Jul 2002, Pradeep Kumar K wrote:

 Hi Peter
   I want to include an option called contains  in my search application.
  for  example: Name contains  'p' like that...
 Thanks for reply.
 Pradeep
 
 Peter Carlson wrote:
 
 Do you really want to be able to find items by letter? Do you have some
 other purpose that tokenizing by letter is trying to get around.
 
 If your do want to tokenize by letter, you can create your own tokenizer
 which creates breaks up items by letter. See the current tokenizers under
 org.apache.lucene.analysis.
 
 --Peter
 
 On 7/10/02 10:26 AM, Pradeep Kumar K [EMAIL PROTECTED] wrote:

 Is it possible to search for a word contains some letters?
 example : God is love
 
 how can I create query to search for sentences having  d.
 I found that lucene is tokenizing a sentence  in to words not into letters.
 is it possible using lucene? Can anybody give a clue for this?


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Combining queries using OR

2002-06-09 Thread Joshua O'Madadhain

On Sun, 9 Jun 2002, Pradeep Kumar K wrote:

How to combine more  queries using 'OR' operator. For example

 Two queries are

   Query quer1=QueryParser.parse(String query1,String field1,analyzer)
   Query quer2=QueryParser.parse(String query2,String field2,analyzer)

 How can I combine these two queries using 'OR' operator i.e. ultimately
 query should be one like

 quer1 OR quer2

Take a look at BooleanQuery.  You want to add quer1 and quer2 to a
BooleanQuery such that each is neither prohibited nor required (this will
make more sense once you see the interface).

Regards,

Joshua O'Madadhain

  [EMAIL PROTECTED] Per Obscuriuswww.ics.uci.edu/~jmadden
   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for.  -- Bill Watterson
 My opinions are too rational and insightful to be those of any organization.



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Normalization of Documents

2002-04-16 Thread Joshua O'Madadhain

from Bernhard Messer:

  Let me know if you find that idea interessting, i would like to work on
  that topic.

Yup, me too.  This is germane to my research as well.

Joshua

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Googlifying lucene querys

2002-02-25 Thread Joshua O'Madadhain

You cannot, in general, structure a Lucene query such that it will yield
the same document rankings that Google would for that (query, document
set).  The reason for this is that Google employs a scoring algorithm that
includes information about the topology of the pages (i.e., how the
pages are linked together).  (An overview of what Google does in this
regard may be found at http://www.google.com/technology/index.html .)
Thus, in order to get Lucene to do what Google does, you'd have to
rewrite large chunks of it.

Joshua

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Mon, 25 Feb 2002, Spencer, Dave wrote:

 I'm pretty sure google gives priority to the words appearing in the
 title and URL.
 
 I believe sect 4.2.5 says this here:
 http://citeseer.nj.nec.com/cache/papers/cs/13017/http:zSzzSzwww-db.stanf
 ord.eduzSzpubzSzpaperszSzgoogle.pdf/brin98anatomy.pdf
 from here: 
 http://citeseer.nj.nec.com/brin98anatomy.html
 
 So you have to have Lucene store the title as a separate field.
 
 This is then what you'd have if like me you boost (the caret is boost)
 the title by *5 and the URL by *2:
 
 +(title:george^5.0 url:george^2.0 contents:george) +(title:bush^5.0
 url:bush^2.0 contents:bush) +(title:white^5.0 url:white^2.0
 contents:white) +(title:house^5.0 url:house^2.0 contents:house)
 
 
 -Original Message-
 From: Ian Lea [mailto:[EMAIL PROTECTED]]
 Sent: Saturday, February 23, 2002 8:15 AM
 To: Lucene Users List
 Subject: Re: Googlifying lucene querys
 
 
 +george +bush +white +house
 
 
 --
 Ian.
 
 Jari Aarniala wrote:
  
  Hello,
  
  Despite of the confusing subject ;) my question is simple. I'm just
  trying out Lucene for the first time and would like to know how one
  would go on implementing the search on the index with the same logic
  that Google uses.
  For example, if the user input is george bush white house,
 how
  do I easily construct a query that searches ALL of the words above? If
 I
  have understood correctly, passing the search string above to the
  queryParser creates a query that search for ANY of the words above.
  
  Thanks for any help,
 
 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
 
 --
 To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:[EMAIL PROTECTED]
 
 


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Lucene Query Structure

2002-02-19 Thread Joshua O'Madadhain

Actually, Winton's suggestion doesn't work because it's inconsistent with
the syntax of BooleanQuery() (the constructor doesn't take arguments, and
add() takes one Query argument, not two).

After considerable study of the documentation, I am still confused about
the semantics of BooleanQuery.  I think I can answer the original
syntactic question posed by sjb, but the overall motivation escapes me.

I believe the correct syntax is (given TermQueries a, b, c, and d):

// create a BooleanQuery for '(a AND b)'
BooleanQuery bq_ab = new BooleanQuery();
bq_ab.add(a, true, false);
bq_ab.add(b, true, false);

// same as above with c and d
BooleanQuery bq_cd = new BooleanQuery();
bq_cd.add(c, true, false);
bq_cd.add(d, true, false);

// join two BooleanQueries together
BooleanQuery bq_abcd = new BooleanQuery();
bq_abcd.add(bq_ab, false, false);
bq_abcd.add(bq_cd, false, false);


Now, as sjb pointed out, (query, false, false) doesn't really seem to
have the semantics of a boolean OR.  In particular:

(1) It's a unary operator: add() adds a Query (or a BooleanClause) to a
BooleanQuery.  OR is a binary operator.

(2) add.(query, false, false) adds a query Q whose satisfaction is
*irrelevant* to that of the resultant BooleanQuery BQ: documents are
neither required to satisfy Q, nor required to *not* satisfy Q, in order
for BQ to be satisfied.  The semantics of a boolean OR should be that at
least *one* of the components (queries) must be satisfied in order for the
entire expression (composite query) to be satisfied.


I conclude that either 
(a) I simply don't understand the proper use of BooleanQuery, or 
(b) BooleanQuery cannot be used to express a boolean OR.  

If (b) is true, either the semantics of BooleanQuery need revising, or
it needs to get called something else so as not to be confusing.  

If the semantics of BooleanQuery are revised, I would suggest changing the
syntax as well to reflect the binary nature of Boolean operators:

BooleanQuery.add(Query q1, Query q2, boolean and)

which would equate to (q1 AND q2) if 'and' is true, otherwise (q1 OR q2).


If there are any papers or references which would explain this better,
that would be great.  Otherwise, I would really appreciate it if one of
the developers would take a few moments to clarify this issue.  I'm trying
to use Lucene as a platform for research in IR; to do this I need a clear
understanding of the exact definition of the scoring system, especially as
it relates to the semantics of queries involving multiple terms.

Once I get a clear understanding of this issue, I would be happy to write
it up and submit it as an addition to the FAQ/docs.

Thanks in advance for any assistance rendered in getting this sorted out.

Regards,

Joshua

On Mon, 18 Feb 2002, Winton Davies wrote:

 BQ(Term, Term, Include, Exclude)
 
 BQ(
   BQ(a,b,true,false)
   BQ(a,b,true,false)
   false,
   false)
 
 Should work...
   Winton

 Lets say I have two queries which I want to combine into one:
 
 (a and b) OR (c and d) 
 
 I would use QueryParser.parse to form the subqueries, but how do I/can I
 combine them with the OR logic?
 
 The BooleanQuery can be used to piece queries together but it does not
 support OR's correct?  (It only supports include, exclude, and 
 preferential.)
 
 Thanks for any help,
 
 sjb



 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Qs re: document scoring and semantics

2002-02-16 Thread Joshua O'Madadhain

(1) The FAQ states the following:

For the record, Lucene's scoring algorithm is, roughly:

  score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t) * coord_q_d
 
where:
  score_d   : score for document d
  sum_t : sum for all terms t
  tf_q  : the square root of the frequency of t in the query
  tf_d  : the square root of the frequency of t in d
  idf_t : log(numDocs/docFreq_t+1) + 1.0
  numDocs   : number of documents in index
  docFreq_t : number of documents containing t
  norm_q: sqrt(sum_t((tf_q*idf_t)^2))
  norm_d_t  : square root of number of tokens in d in the same field as t
  boost_t: the user-specified boost for term t
  coord_q_d  : number of terms in both query and document / number of
terms in query

Is either of the expressions below the correct parenthesization of the
expression above?  If not, what is?

score_d = sum_t(tf_q * (idf_t / norm_q) * tf_d * (idf_t / norm_d_t) *
boost_t) * coord_q_d

score_d = sum_t((tf_q * idf_t) / (norm_q * tf_d * idf_t) / (norm_d_t *
boost_t)) * coord_q_d


(2) I'm trying to make sure that I have a handle on the semantics of
BooleanQuery, especially as they relate to the scoring mechanism.  I would
appreciate it if someone would correct any misapprehensions in the
following descriptions.

* BooleanQuery.add(query, false, false) is equivalent to Boolean OR.  
Only one such query must be satisfied in a given document D (via the
appearance of the associated term(s)/phrase(s) in D) in order for the
score for D to be  0.

* BooleanQuery.add(query, true, false) is equivalent to Boolean AND.  All
such queries must be satisfied (as above) in D in order for score(D) to
be  0.

* BooleanQuery.add(query, false, true) is equivalent to Boolean NAND.  
All such queries must be *not* satisfied in D in order for score(D) to be
 0.

If these, and the semantics for required and prohibited (in
BooleanQuery.add()), are accurate, then the semantics seem rather odd to
me, so I'm hoping that someone will tell me that I'm wrong.  :) In
particular, it seems to me that if you create a BooleanQuery and add a
single TermQuery tq to it with add(tq, false, false) then, according to
the semantics of required and prohibited, *any* document will match
the query...which clearly doesn't make sense.


(3) Somewhat unrelated question: what are the semantics and purpose of
FilteredTermEnum.difference()?  (I see where and how it's used in the
source but I don't understand the motivation.)


(4) I'm still somewhat puzzled by MultiTermQuery.  I believe that Lucene
essentially uses the vector model to identify (and quantify)
query-document matching.  A vector model query consists of one or more
terms, so I would expect MultiTermQuery to be a common query
type.  However, the documentation doesn't make it clear how one binds
several terms together into a MultiTermQuery, and in particular I don't
see how one could separately set the boost for different terms in a
MultiTermQuery.  In my current prototype system, I've been using
BooleanQuery to bind my terms together into a single query, and I'm not
sure that the semantics for BooleanQuery is what I want (hence question
(2) above).  Could someone please explain what MultiTermQuery is for,  
how it should be used, etc.?

Thanks--

Joshua O'Madadhain

 [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




number of terms vs. number of fields

2001-12-01 Thread Joshua O'Madadhain

I have been experimenting with indexing a document set with different sets
of fields.  Specifically, I start out with a contents field that
is a concatenation of all the elements of the original document in which
I'm interested.  This gets me an index with about 7500 unique terms (which
I determine by opening up an IndexReader, extracting the terms in the
index, and counting them).  Then I've been adding each of the separate
elements (title, major subject, minor subject, abstract/extract), one at a
time, to the index (by recreating the index).  

Because contents is the concatenation of the other fields (title,
major, minor, abstract/extract), I would expect that the number of
unique terms in the index would not change if I added the other fields
into the index; each term should just have twice the frequency
as if I only used the contents field.  However, this is not what's
happening; in fact, if I add all the other fields in, the total number of
unique terms is 22000+.

I have verified that contents contains everything that the other fields
do, so I am quite puzzled by this.  Any idea what's going on here, and
why?

For anyone who might be interested in checking this out, my code is below.

Regards,

Joshua


// FileCFDocument.java (a modified version of FileDocument.java in the
// source examples)
import java.io.File;
import java.io.FileInputStream;
import java.util.Vector;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

/** A utility for making Lucene Documents from a File. */

public class FileCFDocument
{
public static Document[] makeDocuments(File f)
throws java.io.FileNotFoundException, java.io.IOException
{
// open file, read it into a byte array and thence a String
FileInputStream fis = new FileInputStream(f);
int n = fis.available();
byte[] data = new byte[n];
fis.read(data);
fis.close();

String s = new String(data);
int ti, so, mj, mn, ab, ex, rf, abex;

Vector vdocs = new Vector();
String contents;

// fields being indexed:
// TI (title)
// MJ (major subject)
// MN (minor subject)
// AB/EX (abstract/extract)

ti = s.indexOf(\nTI );
while (ti != -1)
{
// make a new, empty document
Document doc = new Document();

int k = s.indexOf(\nPN );
doc.add(Field.UnIndexed(number, s.substring(k+4, k+9)));

// DEBUG
System.out.println(s.substring(k, k+9));

s = s.substring(ti+4);

// DEBUG
System.out.println(s.length():  + s.length());

so = s.indexOf(\nSO );
mj = s.indexOf(\nMJ );
mn = s.indexOf(\nMN );
ab = s.indexOf(\nAB );
ex = s.indexOf(\nEX );
rf = s.indexOf(\nRF );

//System.out.println(so:  + so + , mj:  + mj + , mn:  
//  + mn +
//, ab:  + ab + , ex:  + ex + , rf:  + rf);

String title = s.substring(0, so);
doc.add(Field.Text(title, title));
contents = title;

if (mj != -1  mj  mn) // not all documents have major
subject
{
String major = s.substring(mj+4, mn);
//doc.add(Field.Text(major, major));
contents = contents +   + major;
}

if (ab != -1  ab  rf) // if this document has an abstract
{
abex = ab;
String abs = s.substring(ab+4, rf);
//doc.add(Field.Text(abstract, abs));
contents = contents +   + abs;
}
else // it has an extract instead
{
abex = ex;
String extract = s.substring(ex+4, rf);
//doc.add(Field.Text(extract, extract));
contents = contents +   + extract;
}

if (mn != -1  mn  abex)
{
String minor = s.substring(mn+4, abex);
//doc.add(Field.Text(minor, minor));
contents = contents +   + minor;
}

// add a field that's the concatenation of the others so that
// we can search on all fields simultaneously
doc.add(Field.Text(contents, contents));

// DEBUG
//System.out.println(contents);
System.out.println(doc.toString());

ti = s.indexOf(\nTI );
vdocs.add(doc);
}

Document[] docs = new Document[vdocs.size()];
vdocs.toArray(docs);

return docs;
  }

  private FileCFDocument() {}
}

// IndexCFFiles.java (a modification of the IndexFiles.java example)
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;

import java.io.File;
import java.util.Date;

// DEBUG
import