Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-13 Thread Sanyi
 - leave the current implementation, raising an exception;
 - handle the exception and limit the boolean query to the first 1024
 (or what ever the limit is) terms;
 - select, between the possible terms, only the first 1024 (or what
 ever the limit is) more meaningful ones, leaving out all the others.

I like this idea and I would finalize to myself like this:
I'd also create a default rule for that to avoid handling exceptions for people 
who're happy with
the default behavior:

Keep and search for only the longest 1024 fragments, so it'll throw 
a,an,at,and,add,etc.., but
it'll automatically keep 1024 variations like 
alpha,alfa,advanced,automatical,etc..
So, it'll automatically lower the search overhead and will still search fine 
without throwing
exceptions.
(for people who prefer the widest search range and do not care about the huge 
overhead, we could
leave a boolean switch for keeping not the longest, but the shortest fragments)



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Anyone implemented custom hit ranking?

2004-11-13 Thread Sanyi
Hi!

I have problems with short text ranking. I've read about same raking problems 
in the list
archives, but found only hints and toughts (adjust DefaultSimilarity, 
Similarity, etc...), not
complete solutions with source code.
Anyone implemented a good solution for this problem? (example: my search 
application returns about
10-20 pages of 1-2 word hits for hello, and then it starts to list the longer 
texts)
I've implemented a very simple solution: I boost documents shorter than 300 
chars with
1/300*doclength at index time. Now it works a lot better. In fact, I can't see 
any problems now.
Anyway, I think this is not the solution, this is a patch or workaround.
So, I'd be interested in some kind of well designed complete solution for this 
problem.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to efficiently get # of search results, per attribute

2004-11-13 Thread Chris Lamprecht
I'd like to implement a search across several types of entities,
let's say, classes, professors, and departments.  I want the user to
be able to enter a simple, single query and not have to specify what
they're looking for.  Then I want the search results to be something
like this:

Search results for: philosophy boyer

Found: 121 classes - 5 professors - 2 departments

search results here...


I know I could iterate through every hit returned and count them up
myself, but that seems inefficient if there are lots of results.  Is
there some other way to get this kind of information from the search
result set?  My other ideas are: doing a separate search each result
type, or storing different types in different indexes.  Any
suggestions?  Thanks for your help!

-Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to efficiently get # of search results, per attribute

2004-11-13 Thread Nader Henein
It depends on how many results they're looking through, here are two 
scenarios I see:

1] If you don't have that many records you can fetch all the results and 
then do a post parsing step the determine totals

2] If you have a lot of entries in each category and you're worried 
about fetching thousands of records every time, you can just have 
seperate indecies per category and search them in in parallel (not 
Lucene Parallel Search) and you can get up to 100 hits for each one 
(efficiency) but you'll also have the total from the search to display.

Either way you can boost up speed using RamDirectory if you need more 
speed from the search, but whichever approach you choose I would 
recommend that you sit down and do some number crunching to figure out 
which way to go.

Hope this helps
Nader Henein

Chris Lamprecht wrote:
I'd like to implement a search across several types of entities,
let's say, classes, professors, and departments.  I want the user to
be able to enter a simple, single query and not have to specify what
they're looking for.  Then I want the search results to be something
like this:
Search results for: philosophy boyer
Found: 121 classes - 5 professors - 2 departments
search results here...
I know I could iterate through every hit returned and count them up
myself, but that seems inefficient if there are lots of results.  Is
there some other way to get this kind of information from the search
result set?  My other ideas are: doing a separate search each result
type, or storing different types in different indexes.  Any
suggestions?  Thanks for your help!
-Chris
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene Scorers

2004-11-13 Thread Paul Elschot
On Friday 12 November 2004 22:56, Chuck Williams wrote:
 I had a similar need and wrote MaxDisjunctionQuery and
 MaxDisjunctionScorer.  Unfortunately these are not available as a patch
 but I've included the original message below that has the code (modulo
 line breaks added by simple text email format).

 This code is functional -- I use it in my app.  It is optimized for its
 stated use, which involves a small number of clauses.  You'd want to
 improve the incremental sorting (e.g., using the bucket technique of
 BooleanQuery) if you need it for large numbers of clauses.

When you're interested, you can also have a look here for
yet another DisjunctionScorer:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31785

It has the advantage that it implements skipTo() so that it can 
be used as a subscorer of ConjunctionScorer, ie. it can be
faster in situations like this:

aa AND (bb OR cc)

where bb and cc are treated by the DisjunctionScorer.
When aa is a filter this can also be used to implement
a filtering query.

 
 Re. Paul's suggested steps below, I did not integrate this with query
 parser as I didn't need that functionality (since I'm generating the
 multi-field expansions for which max is a much better scoring choice
 than sum).
 
 Chuck
 
 Included message:
 
 -Original Message-
 From: Chuck Williams [mailto:[EMAIL PROTECTED] 
 Sent: Monday, October 11, 2004 9:55 PM
 To: [EMAIL PROTECTED]
 Subject: Contribution: better multi-field searching
 
 The files included below (MaxDisjunctionQuery.java and
 MaxDisjunctionScorer.java) provide a new mechanism for searching across
 multiple fields.

The maximum indeed works well, also when the fields differ a lot length.
 
Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



about Stemming

2004-11-13 Thread Miguel Angel
Hi, I have used the DEMOS of lucene and I want to know as it is
possible to be added  Stemming for my applications.

-- 
Miguel Angel Angeles R.
Asesoria en Conectividad y Servidores
Telf. 97451277

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: about Stemming

2004-11-13 Thread Bernhard Messer
Miguel Angel schrieb:
Hi, I have used the DEMOS of lucene and I want to know as it is
possible to be added  Stemming for my applications.
 

have a look to the lucene-sandbox. Under contributions there are 
stemmers for many different languages.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-13 Thread Paul Elschot
On Saturday 13 November 2004 09:16, Sanyi wrote:
  - leave the current implementation, raising an exception;
  - handle the exception and limit the boolean query to the first 1024
  (or what ever the limit is) terms;
  - select, between the possible terms, only the first 1024 (or what
  ever the limit is) more meaningful ones, leaving out all the others.
 
 I like this idea and I would finalize to myself like this:
 I'd also create a default rule for that to avoid handling exceptions for 
people who're happy with
 the default behavior:
 
 Keep and search for only the longest 1024 fragments, so it'll throw 
a,an,at,and,add,etc.., but
 it'll automatically keep 1024 variations like 
alpha,alfa,advanced,automatical,etc..

Wouldn't it be counterintuitive to only use the longest matches
for truncations?
To have only longer matches one can also use queries with
multiple ? characters, each matching exactly one character.

I think it would be better encourage the users to use longer
and maybe also more prefixes. This gives more precise results
and is more efficient to execute.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How to efficiently get # of search results, per attribute

2004-11-13 Thread Chuck Williams
My Lucene application includes multi-faceted navigation that does a more
complex version of the below.  I've got 5 different taxonomies into
which every indexed item is classified.  The largest of the taxonomies
has over 15,000 entries while the other 4 are much smaller. For every
search query, I determine the best small set of nodes from each taxonomy
to present to the user as drill down options, and provide the counts
regarding how many results fall under each of these nodes.  At present I
only have about 25,000 indexed objects and usually no more than 1,000
results from the initial query.  To determine the drill-down options and
counts, I scan up to 1,000 results computing the counts for all nodes
into which these results classify.  Then for each taxonomy I pick the
best drill-down options available (orthogonal set with reasonable
branching factor that covers all results) and present them with their
counts.  If there are more than 1,000 results, I extrapolate the
computed counts to estimate the actual counts on the entire set of
results.  This is all done with a single index and a single search.

The total time required for performing this computation for the one
large taxonomy is under 10ms, running in full debug mode in my ide.  The
query response time overall is subjectively instantaneous at the UI
(Google-speed or better).  So, unless some dimension of the problem is
much bigger than mine, I doubt performance will be an issue.

Chuck

   -Original Message-
   From: Nader Henein [mailto:[EMAIL PROTECTED]
   Sent: Saturday, November 13, 2004 2:29 AM
   To: Lucene Users List
   Subject: Re: How to efficiently get # of search results, per
attribute
   
   It depends on how many results they're looking through, here are two
   scenarios I see:
   
   1] If you don't have that many records you can fetch all the results
and
   then do a post parsing step the determine totals
   
   2] If you have a lot of entries in each category and you're worried
   about fetching thousands of records every time, you can just have
   seperate indecies per category and search them in in parallel (not
   Lucene Parallel Search) and you can get up to 100 hits for each one
   (efficiency) but you'll also have the total from the search to
display.
   
   Either way you can boost up speed using RamDirectory if you need
more
   speed from the search, but whichever approach you choose I would
   recommend that you sit down and do some number crunching to figure
out
   which way to go.
   
   
   Hope this helps
   
   Nader Henein
   
   
   
   Chris Lamprecht wrote:
   
   I'd like to implement a search across several types of entities,
   let's say, classes, professors, and departments.  I want the user
to
   be able to enter a simple, single query and not have to specify
what
   they're looking for.  Then I want the search results to be
something
   like this:
   
   Search results for: philosophy boyer
   
   Found: 121 classes - 5 professors - 2 departments
   
   search results here...
   
   
   I know I could iterate through every hit returned and count them up
   myself, but that seems inefficient if there are lots of results.
Is
   there some other way to get this kind of information from the
search
   result set?  My other ideas are: doing a separate search each
result
   type, or storing different types in different indexes.  Any
   suggestions?  Thanks for your help!
   
   -Chris
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
   
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to efficiently get # of search results, per attribute

2004-11-13 Thread Chris Lamprecht
Nader and Chuck,

Thanks for the responses, they're both helpful.  My index sizes will
begin on the order of 200,000 classes, and 20,000 instructors (and
much fewer departments), and grow over time to maybe a few million
classes.  Compared to some of the numbers I've seen on this mailing
list, my dataset is fairly small.  I think I'll not worry about
performance for now, until  unless it becomes an issue.

-Chris

On Sat, 13 Nov 2004 15:36:11 -0800, Chuck Williams [EMAIL PROTECTED] wrote:
 My Lucene application includes multi-faceted navigation that does a more
 complex version of the below.  I've got 5 different taxonomies into
 which every indexed item is classified.  The largest of the taxonomies
 has over 15,000 entries while the other 4 are much smaller. For every
 search query, I determine the best small set of nodes from each taxonomy
 to present to the user as drill down options, and provide the counts
 regarding how many results fall under each of these nodes.  At present I
 only have about 25,000 indexed objects and usually no more than 1,000
 results from the initial query.  To determine the drill-down options and
 counts, I scan up to 1,000 results computing the counts for all nodes
 into which these results classify.  Then for each taxonomy I pick the
 best drill-down options available (orthogonal set with reasonable
 branching factor that covers all results) and present them with their
 counts.  If there are more than 1,000 results, I extrapolate the
 computed counts to estimate the actual counts on the entire set of
 results.  This is all done with a single index and a single search.
 
 The total time required for performing this computation for the one
 large taxonomy is under 10ms, running in full debug mode in my ide.  The
 query response time overall is subjectively instantaneous at the UI
 (Google-speed or better).  So, unless some dimension of the problem is
 much bigger than mine, I doubt performance will be an issue.
 
 Chuck
 
 
 
   -Original Message-
   From: Nader Henein [mailto:[EMAIL PROTECTED]
   Sent: Saturday, November 13, 2004 2:29 AM
   To: Lucene Users List
   Subject: Re: How to efficiently get # of search results, per
 attribute
  
   It depends on how many results they're looking through, here are two
   scenarios I see:
  
   1] If you don't have that many records you can fetch all the results
 and
   then do a post parsing step the determine totals
  
   2] If you have a lot of entries in each category and you're worried
   about fetching thousands of records every time, you can just have
   seperate indecies per category and search them in in parallel (not
   Lucene Parallel Search) and you can get up to 100 hits for each one
   (efficiency) but you'll also have the total from the search to
 display.
  
   Either way you can boost up speed using RamDirectory if you need
 more
   speed from the search, but whichever approach you choose I would
   recommend that you sit down and do some number crunching to figure
 out
   which way to go.
  
  
   Hope this helps
  
   Nader Henein
  
  
  
   Chris Lamprecht wrote:
  
   I'd like to implement a search across several types of entities,
   let's say, classes, professors, and departments.  I want the user
 to
   be able to enter a simple, single query and not have to specify
 what
   they're looking for.  Then I want the search results to be
 something
   like this:
   
   Search results for: philosophy boyer
   
   Found: 121 classes - 5 professors - 2 departments
   
   search results here...
   
   
   I know I could iterate through every hit returned and count them up
   myself, but that seems inefficient if there are lots of results.
 Is
   there some other way to get this kind of information from the
 search
   result set?  My other ideas are: doing a separate search each
 result
   type, or storing different types in different indexes.  Any
   suggestions?  Thanks for your help!
   
   -Chris
   
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
   
   
   
   
   
   
  
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Anyone implemented custom hit ranking?

2004-11-13 Thread Chuck Williams
I've done some customization of scoring/ranking and plan to do more.  A
good place to start is with your own Similarity, extending Lucene's
DefaultSimilarity.  Like you, I found the default length normalization
to not work well with my dataset.  I separately weight each indexed
field according to a static relative importance (implemented as a query
boost factor that is automatically applied) and then disable length
normalization altogether by redefining lengthNorm() to always return
1.0f.

I also had problems with tf and idf normalization, especially with idf
dominating the ranking determination.  To address that, my Similarity
increases the base of the log for each, and adds a final square root to
the idf computation since Lucene squares the idf in the score
computations.

Have you tried the explain() mechanism?  It is a great way to see
precisely how your results are being scored (but be warned there is a
final normalization in Hits that explain() does not show -- this final
normalization does not affect the ranking order, but it does affect the
final scores).

Chuck

   -Original Message-
   From: Sanyi [mailto:[EMAIL PROTECTED]
   Sent: Saturday, November 13, 2004 12:38 AM
   To: [EMAIL PROTECTED]
   Subject: Anyone implemented custom hit ranking?
   
   Hi!
   
   I have problems with short text ranking. I've read about same raking
   problems in the list
   archives, but found only hints and toughts (adjust
DefaultSimilarity,
   Similarity, etc...), not
   complete solutions with source code.
   Anyone implemented a good solution for this problem? (example: my
search
   application returns about
   10-20 pages of 1-2 word hits for hello, and then it starts to list
the
   longer texts)
   I've implemented a very simple solution: I boost documents shorter
than
   300 chars with
   1/300*doclength at index time. Now it works a lot better. In fact, I
   can't see any problems now.
   Anyway, I think this is not the solution, this is a patch or
   workaround.
   So, I'd be interested in some kind of well designed complete
solution
   for this problem.
   
   Regards,
   Sanyi
   
   
   
   __
   Do you Yahoo!?
   Check out the new Yahoo! Front Page.
   www.yahoo.com
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Mozilla Desktop Search

2004-11-13 Thread Kevin A. Burton
  
http://www.peerfear.org/rss/permalink/2004/11/13/MozillaDesktopSearch/

The Mozilla foundation may be considering a desktop search 
implementation 
http://computerworld.com/developmenttopics/websitemgmt/story/0,10801,97396,00.html?f=x10 
:

Having launched the much-awaited Version 1.0 of the Firefox
browser yesterday (see story), The Mozilla Foundation is busy
planning enhancements to the open-source product, including the
possibility of integrating it with a variety of desktop search
tools. The Mozilla Foundation also wants to place Firefox in PCs
through reseller deals with PC hardware vendors and continue to
sharpen the product's pop-up ad-blocking technology. 

I'm not sure this is a good idea. Maybe it is though. The technology 
just isn't there for cross platform search.

I'd have to suggest using Lucene but using GCJ for a native compile 
into XPCOM components but I'm not sure if GCJ is up to the job here. 
If this approach is possible then I'd be very excited.

One advantage to this approach is that an HTTP server wouldn't be 
necessary since you're already within the brower.

Good for everyone involved. No bloated Tomcat causing problem and 
blazingly fast access within the browser. Also since TCP isn't 
involved you could gracefully fail when the search service isn't 
running; you could just start it.

--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412