scoring algorithm

2003-09-23 Thread Chris Hennen
Hi,

what is the purpose of tf_q * idf_t / norm_q in Lucene's scoring 
algorithm:
score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)

I dont understand, why the score has to be higher, when the frequency of a 
term in the query is higher. What is normalized by norm_q?

Thanks,
Chris
_
Alles neu beim MSN Messenger: Emoticons, Hintergründe, Spiele! 
http://messenger.msn.de Jetzt die neue Version 6.0 testen!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


taxonomy with lucene

2003-09-23 Thread Gwo Haur, Fun
Hi,
Has anyone tried building taxonomies in Lucene? Any idea what is the
likely approach to be taken?

thanks

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene demo ideas?

2003-09-23 Thread Steven J. Owens
On Wed, Sep 17, 2003 at 08:00:42AM -0400, Erik Hatcher wrote:
 I'm about to start some refactorings on the web application demo that 
 ships with Lucene to show off its features and be usable more easily 
 and cleanly out of the box - i.e. just drop into Tomcat's webapps 
 directory and go.
 
 Does anyone have any suggestions on what they'd like to see in the demo 
 app?

 One odd thought (may be out of scope) is to put together a
google-flavored query language, since most users are going to be
unfamiliar with the default Lucene query language.  Lucene doesn't
really match google, but something google-flavored might be better at
showing off Lucene's features in the demo.

-- 
Steven J. Owens
[EMAIL PROTECTED]

I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt. - Me at http://darksleep.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: taxonomy with lucene

2003-09-23 Thread Eric Jain
 Has anyone tried building taxonomies in Lucene? Any idea what is the
 likely approach to be taken?

I'm storing data with a hierarchical classification in a Lucene index,
if that is what you mean.

The approach is very simple. Every document has a field for a unique
identifier, a field for the identifier of its immediate parent, and a
field for those of all ancestors. This allows you to write queries such
as name:human ancestor:2759 to find organisms that have human in
their name, and are Eukaryotes (but not, say, viruses).

This approach also allows you to efficiently display search results in a
tree, even for very large result sets, as long as the hierarchy doesn't
get to flat.

One drawback of this approach is that doing incremental updates is not
possible, or at least very complicated (duplicated information in the
ancestor field), and you must be careful about the order in which you
add documents to the index (parent before child).

Let me know if you are interested in any further details.

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Proposition :adding minMergeDoc to IndexWriter

2003-09-23 Thread Julien Nioche
Hui,

Concerning an other point of your request list I proposed a patch this week
end on the lucene-dev list and i totally forgot that this feature was
requested on the user list.

This new feature should help you to set a number of Documents to be merged
in memory independently of the mergeFactor.

Any comments would be appreciated

Best regards

Julien Nioche
http://www.lingway.com

-- Debut du message initial ---

De : fp235-5 [EMAIL PROTECTED]
A  : lucene-dev [EMAIL PROTECTED]
Copies :
Date   : Sat, 20 Sep 2003 16:06:06 +0200
Sujet  : [PATCH] IndexWriter : controling the number of Docs merged

Hello,

Someone made a suggestion yesterday about adding a variable to IndexWriter
in
order to control the number of Documents merged in RAMDirectory
independently of
the mergeFactor. (I'm sorry I don't remember who exactly and the mail
arrived at
my office).
I'm proposing a tiny modification of IndexWriter to add this functionality.
A
variable minMergeDocs specifies the number of Documents to be merged in
memory
before starting a new Segment. The mergeFactor still control the number of
Segments created in the Directory and thus it's possible to avoid the file
number limitation problem.

The diff file is attached.

As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to
write
a JUnit test for this feature. The problem is that the SegmentInfos field is
private in IndexWriter and can't be used to check the number and size of the
Segments. I ran a test using the infoStream variable of IndexWriter -
everything
seems to be OK.

Any comments / suggestions are welcome.

Regards

Julien









- Original Message -
From: hui [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, September 22, 2003 3:40 PM
Subject: Re: per-field Analyzer (was Re: some requests)


 Good work, Erik.

 Hui

 - Original Message -
 From: Erik Hatcher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Saturday, September 20, 2003 4:13 AM
 Subject: per-field Analyzer (was Re: some requests)


  On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
   On Friday, September 19, 2003, at 11:15  AM, hui wrote:
   1. Move the Analyzer down to field level from document level so some
   fields
   could be applied a specail analyzer.Other fields still use the
default
   analyzer from the document level.
   For example, I do not need to index the number for the content
   field. It
   helps me reduce the index size a lot when I have some excel files.
   But I
   always need the created_date to be indexed though it is a number
   field.
  
   I know there are some workarounds put in the group, but I think it
   should be
   a good feature to have.
  
   The workaround is to write a custom analyzer and and have it do the
   desired thing per-field.
  
   Hmmm just thinking out loud here without knowing if this is
   possible, but could a generic wrapper Analyzer be written that
   allows other analyzers to be used under the covers based on a field
   name/analyzer mapping?   If so, that would be quite cool and save
   folks from having to write custom analyzers as much to handle this
   pretty typical use-case.  I'll look into this more in the very near
   future personally, but feel free to have a look at this yourself and
   see what you can come up with.
 
  What about something like this?
 
  public class PerFieldWrapperAnalyzer extends Analyzer {
 private Analyzer defaultAnalyzer;
 private Map analyzerMap = new HashMap();
 
 
 public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
   this.defaultAnalyzer = defaultAnalyzer;
 }
 
 public void addAnalyzer(String fieldName, Analyzer analyzer) {
   analyzerMap.put(fieldName, analyzer);
 }
 
 public TokenStream tokenStream(String fieldName, Reader reader) {
   Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
   if (analyzer == null) {
 analyzer = defaultAnalyzer;
   }
 
   return analyzer.tokenStream(fieldName, reader);
 }
  }
 
  This would allow you to construct a single analyzer out of others, on a
  per-field basis, including a default one for any fields that do not
  have a special one.  Whether the constructor should take the map or the
  addAnalyzer method is implemented is debatable, but I prefer the
  addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could
  chain: new PerFieldWrapperAnalyzer(new
  StandardAnalyzer).addAnalyzer(field1, new
  WhitespaceAnalyzer()).addAnalyzer(.).  And I'm more inclined to
  call this thing PerFieldAnalyzerWrapper instead.  Any naming
  suggestions?
 
  This simple little class would seem to be the answer to a very common
  question asked.
 
  Thoughts?  Should this be made part of the core?
 
  Erik
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

Limitation in size of Query class.

2003-09-23 Thread Cecilio Cano Calonge
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi, all.

In the Query class, and his subclasses.
Are there any limitation in size ?
 
Thanks in avance.

- -- 
Cecilio Cano Calonge · Czy 
GNUpg Key = 5011 67C7 7C0B A513 C18F  D93B 071B BA7C 9DF6 9399
 
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/cBSyBxu6fJ32k5kRAud3AJ9fVcYK1+h0zAVed2B77nDszn8vBgCdHzJx
fKAPG0QFo7NWHMf1OZB+rks=
=4Qif
-END PGP SIGNATURE-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Confusion over wildcard search logic

2003-09-23 Thread Dan Quaroni
 BTW, this is with lucene 1.2

Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Confusion over wildcard search logic

2003-09-23 Thread Erik Hatcher
Ah, this is a fun one lots of fiddly issues with how queries work 
and how QueryParser works.  I'll take a stab at some of these inline 
below

On Monday, September 22, 2003, at 08:26  PM, Dan Quaroni wrote:
I have a simple command line interface for testing.
Interesting interface.  Looks like something that if made generic 
enough would be handy to have at least in the sandbox.

  I'm getting some odd
results, though, with certain logic of wildcard searches.
not all your queries are truly WildcardQuery's though.  look at the 
class it constructed to get a better idea of what is happening.

  It seems like
depending on what order I put the fields of the query in alters the 
results
drastically when I AND them together.
Not quite the right explanation of what is happening.  More below

***
This one makes sense
Query name:amb*
State california
name:amb*
[EMAIL PROTECTED]
amb*
2819 total matching documents
Right QueryParser does a little optimization here and anything with 
a simple trailing * turns into a PrefixQuery, meaning all name fields 
that begin with amb.

***
This is the REALLY confusing one.  We know there's a company named AMB
Property Corporation.  Why do I get NO hits?
Query name:amb prop*
State california
name:amb prop*
[EMAIL PROTECTED]
amb prop
0 total matching documents
Notice you're now in PhraseQuery land.  Wildcards don't work like you 
seem to expect here.  What is really happening here is a query for 
documents that have amb and prop terms side by side in that order.  
The asterisk got axed by the analyzer.  If you said name:amb 
name:prop* you'd get some hits I believe, as it would turn into a 
boolean query with a term and wildcard queries either OR'd or AND'd 
together.  PhraseQuery does not support wildcards.  A custom subclass 
of QueryParser could do some interesting things here and expand 
wildcard-like terms like this in a phrase into PhrasePrefixQuery, but 
that is probably overkill here (although maybe not).  Look at the test 
case for PhrasePrefixQuery for some hints.

Ok, so I get some results with this (I know the * isn't neccessary at 
the
end of property, but bear with me for the next example where it goes 
all
screwy)

Query name:amb property*
State california
name:amb property*
[EMAIL PROTECTED]
amb name:amb property*:property*
56 total matching documents
your default field for QueryParser is property*?  Odd field name, or 
is the output fishy?  I'm a bit confused by the property*: there.  
I'm assuming you're outputting the Query.toString here.

See above for a different way to phrase the query.

***
south san francisco is an exact match to the city.  Why does this find 
0
results??!

Query name:amb property* AND city:south san francisco
State california
name:amb property* AND city:south san francisco
[EMAIL PROTECTED]
amb +name:amb property* AND city:south san francisco:property* 
+city:south
name:
amb property* AND city:south san francisco:san name:amb property* AND
city:south
 san francisco:francisco
0 total matching documents
with all the AND's going on, this makes sense because san and 
francisco end up as separate term queries.  you'd have to say 
city:south san francisco to turn it into a PhraseQuery.


Do this and suddenly I get matches
Query name:amb propert* and city:south san fran*
State california
name:amb propert* and city:south san fran*
[EMAIL PROTECTED]
amb name:amb propert* and city:south san fran*:propert* city:south 
san
fran56 total matching documents
you're getting hits on the wildcard match at least, and probably on 
name field amb as well.  again, phrase queries don't support 
wildcards like you've done here with south san fran* so you're not 
matching anything with that.

*
And look, this gets matches too:
Query name:amb propert* and city:south san*
State california
name:amb propert* and city:south san*
[EMAIL PROTECTED]
amb propert city:south san
10732 total matching documents
my guess here is you're getting hits on south san as a phrase query.  
are there that many in that area?

*
Yet do this and we're back to 0 results:
Query name:amb propert* and city:south san fran*
State california
name:amb propert* and city:south san fran*
[EMAIL PROTECTED]
amb propert city:south san fran
0 total matching documents
you're getting zero hits from amb propert* since * is getting 
stripped by the analyzer and there is no amb propert phrase match, 
and with the AND (which should be all uppercase, right?) definitely not 
getting hits.

**
Now flip the query around and it works:
Query city:south san fran* and name:amb propert*
State california
city:south san fran* and name:amb propert*
[EMAIL PROTECTED]
city:south san fran amb city:south san fran* and name:amb
propert*:propert*
56 total matching documents
You didn't quite flip it around, you took off some quotes 

Re: Proposition :adding minMergeDoc to IndexWriter

2003-09-23 Thread hui
It is a great. Julien. Thanks.
Next time I am going to post the requests to the developer groups.

Regards,
Hui
- Original Message - 
From: Julien Nioche [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, September 23, 2003 5:38 AM
Subject: Proposition :adding minMergeDoc to IndexWriter


 Hui,

 Concerning an other point of your request list I proposed a patch this
week
 end on the lucene-dev list and i totally forgot that this feature was
 requested on the user list.

 This new feature should help you to set a number of Documents to be merged
 in memory independently of the mergeFactor.

 Any comments would be appreciated

 Best regards

 Julien Nioche
 http://www.lingway.com

 -- Debut du message initial ---

 De : fp235-5 [EMAIL PROTECTED]
 A  : lucene-dev [EMAIL PROTECTED]
 Copies :
 Date   : Sat, 20 Sep 2003 16:06:06 +0200
 Sujet  : [PATCH] IndexWriter : controling the number of Docs merged

 Hello,

 Someone made a suggestion yesterday about adding a variable to IndexWriter
 in
 order to control the number of Documents merged in RAMDirectory
 independently of
 the mergeFactor. (I'm sorry I don't remember who exactly and the mail
 arrived at
 my office).
 I'm proposing a tiny modification of IndexWriter to add this
functionality.
 A
 variable minMergeDocs specifies the number of Documents to be merged in
 memory
 before starting a new Segment. The mergeFactor still control the number of
 Segments created in the Directory and thus it's possible to avoid the file
 number limitation problem.

 The diff file is attached.

 As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to
 write
 a JUnit test for this feature. The problem is that the SegmentInfos field
is
 private in IndexWriter and can't be used to check the number and size of
the
 Segments. I ran a test using the infoStream variable of IndexWriter -
 everything
 seems to be OK.

 Any comments / suggestions are welcome.

 Regards

 Julien









 - Original Message -
 From: hui [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Monday, September 22, 2003 3:40 PM
 Subject: Re: per-field Analyzer (was Re: some requests)


  Good work, Erik.
 
  Hui
 
  - Original Message -
  From: Erik Hatcher [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Saturday, September 20, 2003 4:13 AM
  Subject: per-field Analyzer (was Re: some requests)
 
 
   On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
On Friday, September 19, 2003, at 11:15  AM, hui wrote:
1. Move the Analyzer down to field level from document level so
some
fields
could be applied a specail analyzer.Other fields still use the
 default
analyzer from the document level.
For example, I do not need to index the number for the content
field. It
helps me reduce the index size a lot when I have some excel files.
But I
always need the created_date to be indexed though it is a number
field.
   
I know there are some workarounds put in the group, but I think it
should be
a good feature to have.
   
The workaround is to write a custom analyzer and and have it do
the
desired thing per-field.
   
Hmmm just thinking out loud here without knowing if this is
possible, but could a generic wrapper Analyzer be written that
allows other analyzers to be used under the covers based on a field
name/analyzer mapping?   If so, that would be quite cool and save
folks from having to write custom analyzers as much to handle this
pretty typical use-case.  I'll look into this more in the very near
future personally, but feel free to have a look at this yourself and
see what you can come up with.
  
   What about something like this?
  
   public class PerFieldWrapperAnalyzer extends Analyzer {
  private Analyzer defaultAnalyzer;
  private Map analyzerMap = new HashMap();
  
  
  public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
this.defaultAnalyzer = defaultAnalyzer;
  }
  
  public void addAnalyzer(String fieldName, Analyzer analyzer) {
analyzerMap.put(fieldName, analyzer);
  }
  
  public TokenStream tokenStream(String fieldName, Reader reader) {
Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
if (analyzer == null) {
  analyzer = defaultAnalyzer;
}
  
return analyzer.tokenStream(fieldName, reader);
  }
   }
  
   This would allow you to construct a single analyzer out of others, on
a
   per-field basis, including a default one for any fields that do not
   have a special one.  Whether the constructor should take the map or
the
   addAnalyzer method is implemented is debatable, but I prefer the
   addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could
   chain: new PerFieldWrapperAnalyzer(new
   StandardAnalyzer).addAnalyzer(field1, new
   

Re: Confusion over wildcard search logic

2003-09-23 Thread Terry Steichen
Erik's analysis is comprehensive and useful.  I think this example reflects
a common (and understandable) oversight - that wildcards do *not* work with
a phrase.  Got caught on that many times myself.  Also there may be
confusion about the format - field:(term1 term2), in that the examples
provided don't seem to make use a parentheses.  Finally, as I recall, there
was some bug(s) with some wildcard patterns with 1.2.

Regards,

Terry

- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, September 22, 2003 10:33 PM
Subject: Re: Confusion over wildcard search logic


 Ah, this is a fun one lots of fiddly issues with how queries work
 and how QueryParser works.  I'll take a stab at some of these inline
 below

 On Monday, September 22, 2003, at 08:26  PM, Dan Quaroni wrote:
  I have a simple command line interface for testing.

 Interesting interface.  Looks like something that if made generic
 enough would be handy to have at least in the sandbox.

I'm getting some odd
  results, though, with certain logic of wildcard searches.

 not all your queries are truly WildcardQuery's though.  look at the
 class it constructed to get a better idea of what is happening.

It seems like
  depending on what order I put the fields of the query in alters the
  results
  drastically when I AND them together.

 Not quite the right explanation of what is happening.  More below

  ***
  This one makes sense
 
  Query name:amb*
  State california
  name:amb*
  [EMAIL PROTECTED]
  amb*
  2819 total matching documents

 Right QueryParser does a little optimization here and anything with
 a simple trailing * turns into a PrefixQuery, meaning all name fields
 that begin with amb.

  ***
  This is the REALLY confusing one.  We know there's a company named AMB
  Property Corporation.  Why do I get NO hits?
 
  Query name:amb prop*
  State california
  name:amb prop*
  [EMAIL PROTECTED]
  amb prop
  0 total matching documents

 Notice you're now in PhraseQuery land.  Wildcards don't work like you
 seem to expect here.  What is really happening here is a query for
 documents that have amb and prop terms side by side in that order.
 The asterisk got axed by the analyzer.  If you said name:amb
 name:prop* you'd get some hits I believe, as it would turn into a
 boolean query with a term and wildcard queries either OR'd or AND'd
 together.  PhraseQuery does not support wildcards.  A custom subclass
 of QueryParser could do some interesting things here and expand
 wildcard-like terms like this in a phrase into PhrasePrefixQuery, but
 that is probably overkill here (although maybe not).  Look at the test
 case for PhrasePrefixQuery for some hints.

  Ok, so I get some results with this (I know the * isn't neccessary at
  the
  end of property, but bear with me for the next example where it goes
  all
  screwy)
 
  Query name:amb property*
  State california
  name:amb property*
  [EMAIL PROTECTED]
  amb name:amb property*:property*
  56 total matching documents

 your default field for QueryParser is property*?  Odd field name, or
 is the output fishy?  I'm a bit confused by the property*: there.
 I'm assuming you're outputting the Query.toString here.

 See above for a different way to phrase the query.

  ***
  south san francisco is an exact match to the city.  Why does this find
  0
  results??!
 
  Query name:amb property* AND city:south san francisco
  State california
  name:amb property* AND city:south san francisco
  [EMAIL PROTECTED]
  amb +name:amb property* AND city:south san francisco:property*
  +city:south
  name:
  amb property* AND city:south san francisco:san name:amb property* AND
  city:south
   san francisco:francisco
  0 total matching documents

 with all the AND's going on, this makes sense because san and
 francisco end up as separate term queries.  you'd have to say
 city:south san francisco to turn it into a PhraseQuery.

  
  Do this and suddenly I get matches
 
  Query name:amb propert* and city:south san fran*
  State california
  name:amb propert* and city:south san fran*
  [EMAIL PROTECTED]
  amb name:amb propert* and city:south san fran*:propert* city:south
  san
  fran56 total matching documents

 you're getting hits on the wildcard match at least, and probably on
 name field amb as well.  again, phrase queries don't support
 wildcards like you've done here with south san fran* so you're not
 matching anything with that.

  *
  And look, this gets matches too:
 
  Query name:amb propert* and city:south san*
  State california
  name:amb propert* and city:south san*
  [EMAIL PROTECTED]
  amb propert city:south san
  10732 total matching documents

 my guess here is you're getting hits on south san as a phrase query.
 are there that many in that area?

  *
  Yet do 

RE: Confusion over wildcard search logic

2003-09-23 Thread Dan Quaroni
Yeah, thanks a lot for your help!  I'm using the release version of Lucene
version 1.2.  

 not all your queries are truly WildcardQuery's though.  look at the 
 class it constructed to get a better idea of what is happening.

Yeah, I printed the queries out to see what was going on and noticed that.
I used wildcard in the subject as a sort of general description of what I
was after rather than its technical meaning to lucene, but I was curious
about why I was getting the query types I was getting.  

 your default field for QueryParser is property*?  Odd field name, or 
 is the output fishy?  I'm a bit confused by the property*: there.  
 I'm assuming you're outputting the Query.toString here.

That's just the Query.toString, yeah.  I didn't set any default fields to
property* so whatever it's spitting out comes from the QueryParser.

 Query name:amb propert* and city:south san fran*

 you're getting hits on the wildcard match at least, and probably on 
 name field amb as well.  again, phrase queries don't support 
 wildcards like you've done here with south san fran* so you're not 
 matching anything with that.

Ok... What's the correct procedure for doing a multi-word wildcard where I
want it to begin with south san fran but not get anything else that
contains south or san?  Just and together the south, san, and fran?  

Although this might produce good results, my understanding was that booleans
retrieve all matches and store them in memory then resolve the booleans.  If
I use the term san to search California, I'm going to need a lot of memory
to store all of the temporary results...!  

Or is that only true when doing booleans on different fields?  If so, I
think we have our solution! :)

 I'm still confused by the output of propert*: here - are you using 
 the CVS version of Lucene?

The honest to goodness 1.2 release. :)

 I hope my above analysis helps.  I may not be perfectly right on 
 everything, but should be relatively close at identifying the issues.  
 Fixing it is more up to how you want to deal with it.  Perhaps a custom 
 QueryParser is more what you're after.

Well, i think the real key piece of information was to drop the quotes to
avoid the phrase queries.  Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Confusion over wildcard search logic

2003-09-23 Thread Dan Quaroni
Your email prompted me to re-read the query parser documentation.  There are
only two examples using parentheses, which seem to be the answer to my
questions.  They are:

 (jakarta OR apache) AND website 

And

title:(+return +pink panther)


These leave a lot unanswered, though.  I mean, for example, what would
happen if the query were:

title:(+return +pink panther)
or 
title:(return*pink panther)

I.e. are the + or booleans required between each word inside the
parentheses?

I guess the answer is that I need to just play with it and find out, but as
others have mentioned, the documentation is lacking in some respects and I'd
say this is one of them... Maybe I'll submit some answers when I figure them
out. :)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Confusion over wildcard search logic

2003-09-23 Thread Erik Hatcher
Better yet, submit some JUnit test cases that show how this stuff 
works, if the ones in Lucene's codebase aren't comprehensive enough.  
This is an excellent way to play with an API and get a good 
understanding of it and documenting it at the same time.

	Erik

On Tuesday, September 23, 2003, at 10:25  AM, Otis Gospodnetic wrote:

Hello,

I guess the answer is that I need to just play with it and find out,
but as
others have mentioned, the documentation is lacking in some respects
and I'd
say this is one of them... Maybe I'll submit some answers when I
figure them out. :)
Thank you, always appreciated.

Otis

__
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Multiple Index search using SearchBean.

2003-09-23 Thread Radhakrishnan, Velayudham
Hi folks, 

I have been using Lucene for a while. Our application needs to sort 
the result set by last modified date. I was really happy to see SearchBean
and
HitsIterator. 

My question is that can I use SearchBean for search using Multiple indices. 
I skimmed through the souce code but could not find any method to search 
using Multiple indices or an array of Directories.

Can anyone help me. Thanks in advance.

~Vela

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Confusion over wildcard search logic

2003-09-23 Thread Erik Hatcher
On Tuesday, September 23, 2003, at 10:09  AM, Dan Quaroni wrote:
Yeah, thanks a lot for your help!  I'm using the release version of 
Lucene
version 1.2.
Perhaps give the latest codebase a try too, just to see if any fixes 
(particularly in that WildcardQuery.toString) are there.

you're getting hits on the wildcard match at least, and probably on
name field amb as well.  again, phrase queries don't support
wildcards like you've done here with south san fran* so you're not
matching anything with that.
Ok... What's the correct procedure for doing a multi-word wildcard 
where I
want it to begin with south san fran but not get anything else that
contains south or san?  Just and together the south, san, and fran?
As far as I know there isn't a way to do this with QueryParser 
currently.  The real way to do this with the existing API is to use 
PhrasePrefixQuery and do some manual setup before using it (like you'll 
see in the current test case and Javadocs for it) by enumerating all 
the terms that start with fran and passing that to a 
PhrasePrefixQuery (isn't this class misnamed?  What does this have to 
do with prefix?) along with south and san.

Although this might produce good results, my understanding was that 
booleans
retrieve all matches and store them in memory then resolve the 
booleans.  If
I use the term san to search California, I'm going to need a lot of 
memory
to store all of the temporary results...!
+south +san +fran* ought to do the trick.  i wouldn't worry about 
memory too much until you've seen it to be a problem.  i think you'll 
be fine (but don't currently have the understanding or data to back 
that up).

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Confusion over wildcard search logic

2003-09-23 Thread Dan Quaroni
 Perhaps give the latest codebase a try too, just to see if any fixes 
 (particularly in that WildcardQuery.toString) are there.

It's our intention to put this into a production environment soon, so we
were waiting on 1.3 to go final before attempting to use it.

 i wouldn't worry about 
 memory too much until you've seen it to be a problem.  i think you'll 
 be fine (but don't currently have the understanding or data to back 
 that up).

The reason I split up the indexes by state was that I was running out of
memory (and searches were very slow) with the whole world of companies in
one index with all kinds of boolean joining.  With it split out, it seems to
do pretty well.

Well, after some extremely brief experimentation (Maybe I shoulda done it
before writing the email, huh?)  I discovered this:

**
This worked pretty well and got me some good results - the company that I
was looking for came back second (Which is pretty good given how general I
made the query)

Query name:(amb proper*)
State california
name:(amb proper*)
[EMAIL PROTECTED]
amb proper*
31988 total matching documents

*
This one matched a ton of documents, however the company I was looking for
came up first in the list, though with a pretty abysmal score of 0.23769014

Query name:(amb prop*) and city:(south san fran*)
State california
name:(amb prop*) and city:(south san fran*)
[EMAIL PROTECTED]
(amb prop*) (city:south city:san city:fran*)
721977 total matching documents


The previous query took 1552 millis.  I was able to reduce that to 285
millis just by adding the +'s you suggested:

Query name:(amb prop*) and city:(+south +san +fran*)
State california
name:(amb prop*) and city:(+south +san +fran*)
[EMAIL PROTECTED]
(amb prop*) (+city:south +city:san +city:fran*)
45011 total matching documents


Incidently, I say everything that I do with great awe at the power of Lucene
and respect for those who have made it possible.  Please don't take anything
I say as a gripe - I'm just learning how things work and that's a neccessary
step to take for any new software package of this type.  You just have to
learn the ins and outs and little quirks to be able to take full advantage
of it.  

Thanks!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Is the lucene index serializable?

2003-09-23 Thread Albert Vila Puig
Can I send a small lucene index by SOAP/TCP/HTTP/RMI? Is there a way to 
serialize a Lucene Index?
I wan to send it from the Indexer server to the Search Server, and then 
do a merge operation in the Search Server with the previous index file.

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Is the lucene index serializable?

2003-09-23 Thread petite_abeille
Can I send a small lucene index by SOAP/TCP/HTTP/RMI? Is there a way 
to serialize a Lucene Index?
I wan to send it from the Indexer server to the Search Server, and 
then do a merge operation in the Search Server with the previous index 
file.
Well, what about a very old fashioned way instead? Something like 
tar.gz.ftp? Not very glamourous, but workable...

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Design question

2003-09-23 Thread Jack Lauman
I, like a lot of other people are new to Lucene.  Practical examples
are pretty scarce.

I have the following site:

http://www.tasteofwhatcom.com

It's built on JBoss 3.0.7/Tomcat 4.1.24, Apache 2.0.47/mod_jk 1.2.4,
MySQL 3.23.57 and RedHat 9.0.

I want to add search capabilites to the site to allow users to
search for entries.  All the menu items are in various MySQL
tables.  In addition some information is in static html and jsp
pages.  Some links are to PDF docs on the site.

How do you create a Lucene index on 'menu items' in the database as
well as the static pages both html and jsp, and any PDF docs that
are linked?  Are there any examples?

Thanks,

Jack


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Design question

2003-09-23 Thread petite_abeille
I, like a lot of other people are new to Lucene.  Practical examples
are pretty scarce.
If you don't mind learning by example, take a look at the Powered by 
Lucene page. A fair number of those projects are open source.

http://jakarta.apache.org/lucene/docs/powered.html

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: scoring algorithm

2003-09-23 Thread Ype Kingma
On Tuesday 23 September 2003 00:12, Chris Hennen wrote:
 Hi,

 what is the purpose of tf_q * idf_t / norm_q in Lucene's scoring
 algorithm:
 score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t)

 I dont understand, why the score has to be higher, when the frequency of a
 term in the query is higher. What is normalized by norm_q?

To give the user the possibility to assign a higher weight to a term in a
query, (by using a term weight or by repeating the term).
The norm_q compensates the total score for the query weights,
leaving the scores of two different queries somewhat comparable.

Kind regards,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]