Mozilla Desktop Search

2004-11-13 Thread Kevin A. Burton
  
http://www.peerfear.org/rss/permalink/2004/11/13/MozillaDesktopSearch/

The Mozilla foundation may be considering a desktop search 
implementation 
 
:

Having launched the much-awaited Version 1.0 of the Firefox
browser yesterday (see story), The Mozilla Foundation is busy
planning enhancements to the open-source product, including the
possibility of integrating it with a variety of desktop search
tools. The Mozilla Foundation also wants to place Firefox in PCs
through reseller deals with PC hardware vendors and continue to
sharpen the product's pop-up ad-blocking technology. 

I'm not sure this is a good idea. Maybe it is though. The technology 
just isn't there for cross platform search.

I'd have to suggest using Lucene but using GCJ for a native compile 
into XPCOM components but I'm not sure if GCJ is up to the job here. 
If this approach is possible then I'd be very excited.

One advantage to this approach is that an HTTP server wouldn't be 
necessary since you're already within the brower.

Good for everyone involved. No bloated Tomcat causing problem and 
blazingly fast access within the browser. Also since TCP isn't 
involved you could gracefully fail when the search service isn't 
running; you could just start it.

--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



RE: Anyone implemented custom hit ranking?

2004-11-13 Thread Chuck Williams
I've done some customization of scoring/ranking and plan to do more.  A
good place to start is with your own Similarity, extending Lucene's
DefaultSimilarity.  Like you, I found the default length normalization
to not work well with my dataset.  I separately weight each indexed
field according to a static relative importance (implemented as a query
boost factor that is automatically applied) and then disable length
normalization altogether by redefining lengthNorm() to always return
1.0f.

I also had problems with tf and idf normalization, especially with idf
dominating the ranking determination.  To address that, my Similarity
increases the base of the log for each, and adds a final square root to
the idf computation since Lucene squares the idf in the score
computations.

Have you tried the explain() mechanism?  It is a great way to see
precisely how your results are being scored (but be warned there is a
final normalization in Hits that explain() does not show -- this final
normalization does not affect the ranking order, but it does affect the
final scores).

Chuck

  > -Original Message-
  > From: Sanyi [mailto:[EMAIL PROTECTED]
  > Sent: Saturday, November 13, 2004 12:38 AM
  > To: [EMAIL PROTECTED]
  > Subject: Anyone implemented custom hit ranking?
  > 
  > Hi!
  > 
  > I have problems with short text ranking. I've read about same raking
  > problems in the list
  > archives, but found only hints and toughts (adjust
DefaultSimilarity,
  > Similarity, etc...), not
  > complete solutions with source code.
  > Anyone implemented a good solution for this problem? (example: my
search
  > application returns about
  > 10-20 pages of 1-2 word hits for "hello", and then it starts to list
the
  > longer texts)
  > I've implemented a very simple solution: I boost documents shorter
than
  > 300 chars with
  > 1/300*doclength at index time. Now it works a lot better. In fact, I
  > can't see any problems now.
  > Anyway, I think this is not "the solution", this is a patch or
  > workaround.
  > So, I'd be interested in some kind of well designed complete
solution
  > for this problem.
  > 
  > Regards,
  > Sanyi
  > 
  > 
  > 
  > __
  > Do you Yahoo!?
  > Check out the new Yahoo! Front Page.
  > www.yahoo.com
  > 
  > 
  > 
  >
-
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to efficiently get # of search results, per attribute

2004-11-13 Thread Chris Lamprecht
Nader and Chuck,

Thanks for the responses, they're both helpful.  My index sizes will
begin on the order of 200,000 classes, and 20,000 instructors (and
much fewer departments), and grow over time to maybe a few million
classes.  Compared to some of the numbers I've seen on this mailing
list, my dataset is fairly small.  I think I'll not worry about
performance for now, until & unless it becomes an issue.

-Chris

On Sat, 13 Nov 2004 15:36:11 -0800, Chuck Williams <[EMAIL PROTECTED]> wrote:
> My Lucene application includes multi-faceted navigation that does a more
> complex version of the below.  I've got 5 different taxonomies into
> which every indexed item is classified.  The largest of the taxonomies
> has over 15,000 entries while the other 4 are much smaller. For every
> search query, I determine the best small set of nodes from each taxonomy
> to present to the user as drill down options, and provide the counts
> regarding how many results fall under each of these nodes.  At present I
> only have about 25,000 indexed objects and usually no more than 1,000
> results from the initial query.  To determine the drill-down options and
> counts, I scan up to 1,000 results computing the counts for all nodes
> into which these results classify.  Then for each taxonomy I pick the
> best drill-down options available (orthogonal set with reasonable
> branching factor that covers all results) and present them with their
> counts.  If there are more than 1,000 results, I extrapolate the
> computed counts to estimate the actual counts on the entire set of
> results.  This is all done with a single index and a single search.
> 
> The total time required for performing this computation for the one
> large taxonomy is under 10ms, running in full debug mode in my ide.  The
> query response time overall is subjectively instantaneous at the UI
> (Google-speed or better).  So, unless some dimension of the problem is
> much bigger than mine, I doubt performance will be an issue.
> 
> Chuck
> 
> 
> 
>  > -Original Message-
>  > From: Nader Henein [mailto:[EMAIL PROTECTED]
>  > Sent: Saturday, November 13, 2004 2:29 AM
>  > To: Lucene Users List
>  > Subject: Re: How to efficiently get # of search results, per
> attribute
>  >
>  > It depends on how many results they're looking through, here are two
>  > scenarios I see:
>  >
>  > 1] If you don't have that many records you can fetch all the results
> and
>  > then do a post parsing step the determine totals
>  >
>  > 2] If you have a lot of entries in each category and you're worried
>  > about fetching thousands of records every time, you can just have
>  > seperate indecies per category and search them in in parallel (not
>  > Lucene Parallel Search) and you can get up to 100 hits for each one
>  > (efficiency) but you'll also have the total from the search to
> display.
>  >
>  > Either way you can boost up speed using RamDirectory if you need
> more
>  > speed from the search, but whichever approach you choose I would
>  > recommend that you sit down and do some number crunching to figure
> out
>  > which way to go.
>  >
>  >
>  > Hope this helps
>  >
>  > Nader Henein
>  >
>  >
>  >
>  > Chris Lamprecht wrote:
>  >
>  > >I'd like to implement a search across several types of "entities",
>  > >let's say, classes, professors, and departments.  I want the user
> to
>  > >be able to enter a simple, single query and not have to specify
> what
>  > >they're looking for.  Then I want the search results to be
> something
>  > >like this:
>  > >
>  > >Search results for: "philosophy boyer"
>  > >
>  > >Found: 121 classes - 5 professors - 2 departments
>  > >
>  > >
>  > >
>  > >
>  > >I know I could iterate through every hit returned and count them up
>  > >myself, but that seems inefficient if there are lots of results.
> Is
>  > >there some other way to get this kind of information from the
> search
>  > >result set?  My other ideas are: doing a separate search each
> result
>  > >type, or storing different types in different indexes.  Any
>  > >suggestions?  Thanks for your help!
>  > >
>  > >-Chris
>  > >
>  >
> >-
>  > >To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > >For additional commands, e-mail:
> [EMAIL PROTECTED]
>  > >
>  > >
>  > >
>  > >
>  > >
>  > >
>  >
>  >
> -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How to efficiently get # of search results, per attribute

2004-11-13 Thread Chuck Williams
My Lucene application includes multi-faceted navigation that does a more
complex version of the below.  I've got 5 different taxonomies into
which every indexed item is classified.  The largest of the taxonomies
has over 15,000 entries while the other 4 are much smaller. For every
search query, I determine the best small set of nodes from each taxonomy
to present to the user as drill down options, and provide the counts
regarding how many results fall under each of these nodes.  At present I
only have about 25,000 indexed objects and usually no more than 1,000
results from the initial query.  To determine the drill-down options and
counts, I scan up to 1,000 results computing the counts for all nodes
into which these results classify.  Then for each taxonomy I pick the
best drill-down options available (orthogonal set with reasonable
branching factor that covers all results) and present them with their
counts.  If there are more than 1,000 results, I extrapolate the
computed counts to estimate the actual counts on the entire set of
results.  This is all done with a single index and a single search.

The total time required for performing this computation for the one
large taxonomy is under 10ms, running in full debug mode in my ide.  The
query response time overall is subjectively instantaneous at the UI
(Google-speed or better).  So, unless some dimension of the problem is
much bigger than mine, I doubt performance will be an issue.

Chuck

  > -Original Message-
  > From: Nader Henein [mailto:[EMAIL PROTECTED]
  > Sent: Saturday, November 13, 2004 2:29 AM
  > To: Lucene Users List
  > Subject: Re: How to efficiently get # of search results, per
attribute
  > 
  > It depends on how many results they're looking through, here are two
  > scenarios I see:
  > 
  > 1] If you don't have that many records you can fetch all the results
and
  > then do a post parsing step the determine totals
  > 
  > 2] If you have a lot of entries in each category and you're worried
  > about fetching thousands of records every time, you can just have
  > seperate indecies per category and search them in in parallel (not
  > Lucene Parallel Search) and you can get up to 100 hits for each one
  > (efficiency) but you'll also have the total from the search to
display.
  > 
  > Either way you can boost up speed using RamDirectory if you need
more
  > speed from the search, but whichever approach you choose I would
  > recommend that you sit down and do some number crunching to figure
out
  > which way to go.
  > 
  > 
  > Hope this helps
  > 
  > Nader Henein
  > 
  > 
  > 
  > Chris Lamprecht wrote:
  > 
  > >I'd like to implement a search across several types of "entities",
  > >let's say, classes, professors, and departments.  I want the user
to
  > >be able to enter a simple, single query and not have to specify
what
  > >they're looking for.  Then I want the search results to be
something
  > >like this:
  > >
  > >Search results for: "philosophy boyer"
  > >
  > >Found: 121 classes - 5 professors - 2 departments
  > >
  > >
  > >
  > >
  > >I know I could iterate through every hit returned and count them up
  > >myself, but that seems inefficient if there are lots of results.
Is
  > >there some other way to get this kind of information from the
search
  > >result set?  My other ideas are: doing a separate search each
result
  > >type, or storing different types in different indexes.  Any
  > >suggestions?  Thanks for your help!
  > >
  > >-Chris
  > >
  >
>-
  > >To unsubscribe, e-mail: [EMAIL PROTECTED]
  > >For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > >
  > >
  > >
  > >
  > >
  > 
  >
-
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-13 Thread Paul Elschot
On Saturday 13 November 2004 09:16, Sanyi wrote:
> > - leave the current implementation, raising an exception;
> > - handle the exception and limit the boolean query to the first 1024
> > (or what ever the limit is) terms;
> > - select, between the possible terms, only the first 1024 (or what
> > ever the limit is) more meaningful ones, leaving out all the others.
> 
> I like this idea and I would finalize to myself like this:
> I'd also create a default rule for that to avoid handling exceptions for 
people who're happy with
> the default behavior:
> 
> Keep and search for only the longest 1024 fragments, so it'll throw 
a,an,at,and,add,etc.., but
> it'll automatically keep 1024 variations like 
alpha,alfa,advanced,automatical,etc..

Wouldn't it be counterintuitive to only use the longest matches
for truncations?
To have only longer matches one can also use queries with
multiple ? characters, each matching exactly one character.

I think it would be better encourage the users to use longer
and maybe also more prefixes. This gives more precise results
and is more efficient to execute.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: about Stemming

2004-11-13 Thread Bernhard Messer
Miguel Angel schrieb:
Hi, I have used the DEMOS of lucene and I want to know as it is
possible to be added  Stemming for my applications.
 

have a look to the lucene-sandbox. Under contributions there are 
stemmers for many different languages.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


about Stemming

2004-11-13 Thread Miguel Angel
Hi, I have used the DEMOS of lucene and I want to know as it is
possible to be added  Stemming for my applications.

-- 
Miguel Angel Angeles R.
Asesoria en Conectividad y Servidores
Telf. 97451277

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Scorers

2004-11-13 Thread Paul Elschot
On Friday 12 November 2004 22:56, Chuck Williams wrote:
> I had a similar need and wrote MaxDisjunctionQuery and
> MaxDisjunctionScorer.  Unfortunately these are not available as a patch
> but I've included the original message below that has the code (modulo
> line breaks added by simple text email format).
>
> This code is functional -- I use it in my app.  It is optimized for its
> stated use, which involves a small number of clauses.  You'd want to
> improve the incremental sorting (e.g., using the bucket technique of
> BooleanQuery) if you need it for large numbers of clauses.

When you're interested, you can also have a look here for
yet another DisjunctionScorer:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31785

It has the advantage that it implements skipTo() so that it can 
be used as a subscorer of ConjunctionScorer, ie. it can be
faster in situations like this:

aa AND (bb OR cc)

where bb and cc are treated by the DisjunctionScorer.
When aa is a filter this can also be used to implement
a filtering query.

 
> Re. Paul's suggested steps below, I did not integrate this with query
> parser as I didn't need that functionality (since I'm generating the
> multi-field expansions for which max is a much better scoring choice
> than sum).
> 
> Chuck
> 
> Included message:
> 
> -Original Message-
> From: Chuck Williams [mailto:[EMAIL PROTECTED] 
> Sent: Monday, October 11, 2004 9:55 PM
> To: [EMAIL PROTECTED]
> Subject: Contribution: better multi-field searching
> 
> The files included below (MaxDisjunctionQuery.java and
> MaxDisjunctionScorer.java) provide a new mechanism for searching across
> multiple fields.

The maximum indeed works well, also when the fields differ a lot length.
 
Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to efficiently get # of search results, per attribute

2004-11-13 Thread Nader Henein
It depends on how many results they're looking through, here are two 
scenarios I see:

1] If you don't have that many records you can fetch all the results and 
then do a post parsing step the determine totals

2] If you have a lot of entries in each category and you're worried 
about fetching thousands of records every time, you can just have 
seperate indecies per category and search them in in parallel (not 
Lucene Parallel Search) and you can get up to 100 hits for each one 
(efficiency) but you'll also have the total from the search to display.

Either way you can boost up speed using RamDirectory if you need more 
speed from the search, but whichever approach you choose I would 
recommend that you sit down and do some number crunching to figure out 
which way to go.

Hope this helps
Nader Henein

Chris Lamprecht wrote:
I'd like to implement a search across several types of "entities",
let's say, classes, professors, and departments.  I want the user to
be able to enter a simple, single query and not have to specify what
they're looking for.  Then I want the search results to be something
like this:
Search results for: "philosophy boyer"
Found: 121 classes - 5 professors - 2 departments

I know I could iterate through every hit returned and count them up
myself, but that seems inefficient if there are lots of results.  Is
there some other way to get this kind of information from the search
result set?  My other ideas are: doing a separate search each result
type, or storing different types in different indexes.  Any
suggestions?  Thanks for your help!
-Chris
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How to efficiently get # of search results, per attribute

2004-11-13 Thread Chris Lamprecht
I'd like to implement a search across several types of "entities",
let's say, classes, professors, and departments.  I want the user to
be able to enter a simple, single query and not have to specify what
they're looking for.  Then I want the search results to be something
like this:

Search results for: "philosophy boyer"

Found: 121 classes - 5 professors - 2 departments




I know I could iterate through every hit returned and count them up
myself, but that seems inefficient if there are lots of results.  Is
there some other way to get this kind of information from the search
result set?  My other ideas are: doing a separate search each result
type, or storing different types in different indexes.  Any
suggestions?  Thanks for your help!

-Chris

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Anyone implemented custom hit ranking?

2004-11-13 Thread Sanyi
Hi!

I have problems with short text ranking. I've read about same raking problems 
in the list
archives, but found only hints and toughts (adjust DefaultSimilarity, 
Similarity, etc...), not
complete solutions with source code.
Anyone implemented a good solution for this problem? (example: my search 
application returns about
10-20 pages of 1-2 word hits for "hello", and then it starts to list the longer 
texts)
I've implemented a very simple solution: I boost documents shorter than 300 
chars with
1/300*doclength at index time. Now it works a lot better. In fact, I can't see 
any problems now.
Anyway, I think this is not "the solution", this is a patch or workaround.
So, I'd be interested in some kind of well designed complete solution for this 
problem.

Regards,
Sanyi



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

2004-11-13 Thread Sanyi
> - leave the current implementation, raising an exception;
> - handle the exception and limit the boolean query to the first 1024
> (or what ever the limit is) terms;
> - select, between the possible terms, only the first 1024 (or what
> ever the limit is) more meaningful ones, leaving out all the others.

I like this idea and I would finalize to myself like this:
I'd also create a default rule for that to avoid handling exceptions for people 
who're happy with
the default behavior:

Keep and search for only the longest 1024 fragments, so it'll throw 
a,an,at,and,add,etc.., but
it'll automatically keep 1024 variations like 
alpha,alfa,advanced,automatical,etc..
So, it'll automatically lower the search overhead and will still search fine 
without throwing
exceptions.
(for people who prefer the widest search range and do not care about the huge 
overhead, we could
leave a boolean switch for keeping not the longest, but the shortest fragments)



__ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]