Re: Term Weights and Clustering

2005-02-23 Thread David Spencer
I'm a little confused on exactly, exactly what you want but if your goal 
is to cluster your papers w/ carrot2 then I found these links helpful:

http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
Only caveat is I found that carrot2 tends to not scale beyond 200 or so 
docs, though this probably depends on length of docs & the # of 
different tokens.

I was able to use the above to integ w/ a lucene search results page in 
just an hour or so.

Owen Densmore wrote:
I'm building a TDM (Term Document Matrix) from my lucene index.  As part 
of this, it would be useful to have the document term weights (the 
TF*IDF-weight) if they are already available.  Naturally I can compute 
them, but I suspect they are lurking behind an API I've not discovered 
yet.  Is there an API for getting them?

I'm doing this as a first step in discovering a good set of clustering 
labels.  My data collection is 1200 research papers, all of which have 
good meta data: titles, authors, abstracts, keyphrases and so on.

One source for how to do this is the thesis of Stanislaw Osinski and 
others like it:
http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm
And the Carrot2 project which uses similar techniques.
http://www.cs.put.poznan.pl/dweiss/carrot/

My problem is simple: I need a fairly clear discussion on exactly how to 
generate the labels, and to assign documents to them.  The thesis is 
quite good, but I'm not sure I can reduce it to practice in the 2-3 days 
I have to evaluate it!  Lucene has made the TDM easy to calculate, but I 
basically don't know what to do next!

Can anyone comment on whether or not this will work, and if so, suggest 
a quick way to get a demo on the air?  For example, I don't seem to be 
able to ask Carrot2 to do a Google "site" search.  If I could, I could 
simply aim Carrot2 at my collection with a very general search and see 
what clusters it discovers.  This may be a gross misuse of Carrot2's 
clustering anyway, so could easily be a blind alley.

Or is there a different stunt with Lucene that might work?  For example, 
use Lucene to cluster the docs using a batch search where the queries 
are Library of Congress descriptions!  Batch searching is *really fast* 
in Lucene -- I've been able to search the data collection against each 
distinct keyphrase in seconds!

Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: knowing which field contributed the search result

2005-02-22 Thread David Spencer
John Wang wrote:
Hi David:
Can you further explain which calls specically would solve my problem?
Not in depth but anyway:
Examine the output of Explanation.toHtml() and/or 
Explanation.toString(). Does it contain the info you want..if so call 
the other Explanation methods and/or dig into the src if necessary. 
getValue() is the score, so all that's missing is the name of the field 
and I'm not sure if that's directly returned or not.


Thanks
-John
On Mon, 21 Feb 2005 12:20:15 -0800, David Spencer
<[EMAIL PROTECTED]> wrote:
John Wang wrote:

Anyone has any thoughts on this?
Does this help?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Searchable.html#explain(org.apache.lucene.search.Query,%20int)
Thanks
-John
On Wed, 16 Feb 2005 14:39:52 -0800, John Wang <[EMAIL PROTECTED]> wrote:

Hi:
 Is there way to find out given a hit from a search, find out which
fields contributed to the hit?
e.g.
If my search for:
contents1="brown fox" OR contents2="black bear"
can the document founded by this query also have information on
whether it was found via contents1 or contents2 or both.
Thanks
-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: knowing which field contributed the search result

2005-02-21 Thread David Spencer
John Wang wrote:
Anyone has any thoughts on this?
Does this help?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Searchable.html#explain(org.apache.lucene.search.Query,%20int)
Thanks
-John
On Wed, 16 Feb 2005 14:39:52 -0800, John Wang <[EMAIL PROTECTED]> wrote:
Hi:
  Is there way to find out given a hit from a search, find out which
fields contributed to the hit?
e.g.
If my search for:
contents1="brown fox" OR contents2="black bear"
can the document founded by this query also have information on
whether it was found via contents1 or contents2 or both.
Thanks
-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Handling Synonyms

2005-02-21 Thread David Spencer
Luke Shannon wrote:
Hello;
Does anyone see a problem with the following approach?
No, no problem with it and it's in fact what my "Wordnet Query 
Expansion" sandbox module does.

The nice thing about Lucene is you at least have the option of doing 
things the other way - you can write a custom Analyzer that puts all 
synonyms at the same token offset so they appear to be in the same place 
in the token stream. Thinking about it...this approach, with the 
Analyzer, lets user search for phrases which would match a synonym, so, 
using your example below, the text "bright red engine" would be matched 
by either phrase "bright red" or "bright colour". Doing the query 
expansion is trickier if you allow phrases.

For synonyms, rather than putting them in the index, I put the original term
and all the synonyms in the query.
Every time I create a query, I check if the term has any synonyms. If it
does, I create Boolean Query OR'ing one Query object for each synonym.
So if I have a synoym list:
red = colour, primary, stop
And someone wants to search the desc field for the red, I would end up with
something like:
( (desc:*red*) (desc:*colout*) (desc:*stop*) ).
I don't like that bit about substring terms, but if it's right for you 
ok - if you insist on loosening things I'd consider fuzzy terms 
(desc:red~ ...etc).


Now the synonyms would'nt be in the index, the Query would account for all
the possible synonym terms.
Luke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread David Spencer
markharw00d wrote:
 >>But this brings up - has anyone run Lucene off a database trigger or 
are  triggers known to be slow and bad for this use?

I suspect the tricky bit would be knowing when to balancing the calls to 
Reader/Writer closes, opens and optimizes.
Record updates are the usual fun and games involving a reader.delete and 
a document.write.
I agree this is the usual tricky/"fun" thing.
In similar situations I have:
- batched the updates in, well, sort of a "queue"
- flushed the "queue" after "t" seconds or "n" documents (e.g. t=60sec, 
n=1000 docs)

Part of the trick is a document that changes multiple times during one 
of these periods - if you have a "add queue" and a "delete queue" then 
you'll probably have the wrong index with the doc either zero times or 
more than one time - not impossible to cover, just something to keep in mind

- Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread David Spencer
Otis Gospodnetic wrote:
The most obvious answer is that the full-text indexing features of
RDBMS's are not as good (as fast) as Lucene.  MySQL, PostgreSQL,
Oracle, MS SQL Server etc. all have full-text indexing/searching
features, 
but I always hear people complaining about the speed. 
Yeah, but in theory, in the ideal world :), it should't be any slower - 
there's no magic Lucene has that DB's don't.  And the big advantage of 
it being embedded in the DB is the index can always be up to date, just 
as if you had Lucene updating the index based on a trigger. You don't 
need any separate cron job to periodically update the index.

But this brings up - has anyone run Lucene off a database trigger or are 
 triggers known to be slow and bad for this use?

A
person from a well-known online bookseller told me recently that Lucene
was about 10x faster that MySQL for full-text searching, and I am
currently helping someone get away from MySQL and into Lucene for
performance reasons.
Otis

--- "Steven J. Owens" <[EMAIL PROTECTED]> wrote:

Hi,
I was rambling to some friends about an idea to build a
cache-aware JDBC driver wrapper, to make it easier to keep a lucene
index of a database up to date.
They asked me a question that I have to take seriously, which is
that most RDBMSes provide some built-in fulltext searching -
postgres,
mysql, even oracle - why not use that instead of adding another layer
of caching?
I have to take this question seriously, especially since it
reminds me a lot of what Doug has often said to folks contemplating
doing similar things (caching query results, etc) with Lucene.
Has anybody done some serious investigation into this, and could
summarize the pros and cons?
--
Steven J. Owens
[EMAIL PROTECTED]
"I'm going to make broad, sweeping generalizations and strong,
declarative statements, because otherwise I'll be here all night and
this document will be four times longer and much less fun to read.
Take it all with a grain of salt." - http://darksleep.com/notablog
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document comparison

2005-02-18 Thread David Spencer
Otis Gospodnetic wrote:
Matt,
Erik and I have some code for this in Lucene in Action, but David
Spencer did this since the book was published:
  http://www.lucenebook.com/blog/announcements/more_like_this.html

If you want an informal way of doing it you're right, just feed the 
words of the source doc to a query. The doc for the code it is at this 
easy to  remember URL:
http://searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/SimilarityQueries.html#formSimilarQuery(java.lang.String,%20org.apache.lucene.analysis.Analyzer,%20java.lang.String,%20java.util.Set)

Follow Otis's link above to my weblog for the code.
The "MoreLikeThis" stuff is similar but more sophisticated.
Also if you want the IR way I think you'd do a "cosine measure". I know 
carrot2 has the code - this might be it:

http://www.searchmorph.com/pub/carrot2/jd/com/chilang/carrot/filter/cluster/rough/measure/CosineCoefficient.html
Otis
--- Matt Chaput <[EMAIL PROTECTED]> wrote:

Is there a simple, efficient way to compute similarity of documents 
indexed with Lucene?

My first, naive idea is to use the entire contents of one document as
a 
query to the second document, and use the score as a similarity 
measurement. But I think I'm probably way off base with that.

Can any IR pros set me straight? Thanks very much.
Matt
--
Matt Chaput
Word Monkey
Side Effects Software Inc.
"A goddamned ray of sunshine all the goddamned time"
-- Sparkle Hayter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Performance

2005-02-18 Thread David Spencer
Michael Celona wrote:
Just tried that... works like a charm... thanks...
Could you clarify what the problem was - just the overhead of opening 
IndexSearchers?
Michael
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 18, 2005 4:42 PM
To: Lucene Users List; Chris Lamprecht
Subject: Re: Search Performance

Or you could just open a new IndexSearcher, forget the old one, and
have GC collect it when everyone is done with it.
Otis
--- Chris Lamprecht <[EMAIL PROTECTED]> wrote:

I should have mentioned, the reason for not doing this the obvious,
simple way (just close the Searcher and reopen it if a new version is
available) is because some threads could be in the middle of
iterating
through the search Hits.  If you close the Searcher they get a Bad
file descriptor IOException.  As I found out the hard way :)
On Fri, 18 Feb 2005 15:03:29 -0600, Chris Lamprecht
<[EMAIL PROTECTED]> wrote:
I recently dealt with the issue of re-using a Searcher with an
index
that changes often.  I wrote a class that allows my searching
classes
to "check out" a lucene Searcher, perform a search, and then return
the Searcher.  It's similar to a database connection pool, except
that
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Performance

2005-02-18 Thread David Spencer
Are you using the highlighter or doing anything non-trivial in 
displaying the results?

Are the pages being compressed (mod_gzip or some servlet equivalent)? 
This definitely helps, though to see the effect you may have to make 
sure your simulated users are "remote".

Also consider caching search results if it's reasonable to assume users 
may search for the same things.

I made some measurements on caching on my site:
http://www.searchmorph.com/weblog/index.php?id=41
http://www.searchmorph.com/weblog/index.php?id=40
And I use OSCache:
http://www.searchmorph.com/weblog/index.php?id=38
http://www.opensymphony.com/oscache/


Michael Celona wrote:
What is single handedly the best way to improve search performance?  I have
an index in the 2G range stored on the local file system of the searcher.
Under a load test of 5 simultaneous users my average search time is ~4700
ms.  Under a load test of 10 simultaneous users my average search time is
~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz
Zeons.  Any ideas?  

 

Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Performance

2005-02-18 Thread David Spencer
Noone has mentioned JVM options yet.
[a] -server
[b] -XX:CompileThreshold=1000
[c] Raise the -Xms value if you haven't done so (-Xms...)
I think by default the VM runs with "-client" but -server makes more 
sense for web containers (Tomcat etc).
[b] tells the hotspot compiler to compile methods sooner - you can lower 
the 1000 to, say, '2' makes it compile methods after they've executed 2 
times - I had trouble once lowering this to 1 for some reason


Also, even though you're not supposed to need to do this, I've found it 
helpful to force gc() periodically e.g. every minute via this idiom:

public static long gc()
{
long bef = mem();
System.gc();
sleep( 100);
System.runFinalization();
sleep( 100);
System.gc();
long aft= mem();
return aft-bef;
}
Michael Celona wrote:
What is single handedly the best way to improve search performance?  I have
an index in the 2G range stored on the local file system of the searcher.
Under a load test of 5 simultaneous users my average search time is ~4700
ms.  Under a load test of 10 simultaneous users my average search time is
~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz
Zeons.  Any ideas?  

 

Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2005-02-08 Thread David Spencer
Owen Densmore wrote:
I would like to be able to analyze my document collection (~1200 
documents) and discover good "buckets" of categories for them.  I'm 
pretty sure this is termed Document Clustering .. finding the emergent 
clumps the documents fall naturally into judging from their term vectors.

Looking at the discussion that flared roughly a year ago (last message 
2003-11-12) with the subject Document Clustering, it seems Lucene should 
be able to help with this.  Has anyone had success with this recently?

Last year it was suggested Carrot2 could help, and it would even produce 
good labels for the clusters.  Has this proven to be true?  Our goal is 
to use clustering to build a nifty graphic interface, probably using Flash.
Carrot2 seems to work nicely.
Demo here...
Search for something like "artificial intelligence" in my Wikipedia 
Search engine:

http://www.searchmorph.com/kat/wikipedia.jsp?s=artificial+intelligence
The click on "see clustered results.." link to go here:
http://www.searchmorph.com/kat/wikipedia-cluster.jsp?s=artificial%20intelligence
And voilla, what seems like decent clusters.
I'm not sure what the complexity of the algorithm is, but for me ~100 
docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

I suggest: try it w/ ~100 docs, and if you like what you see, keep 
increasing the # of docs you give it. You might have to wait a while w/ 
all 1,200 docs...

- Dave



Thanks for any pointers.
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread David Spencer
Nice, very similar to what I was thinking of, where the most significant 
difference is probably just that I was thinking of a batch indexer, not 
one embedded in a web container. Probably a worthwhile contribution to 
the sandbox.


Aad Nales wrote:
Yep,
This is how we do it.
We have a search.xml that maps database fields to search fields and a 
parameter part that describes the 'click for detailed result url' and 
the parameter names (based on the search fields). In this xml we also 
describe how the different fields should be stored we have for instance 
a number of large text fields for we use the unstored option.

The framework that we have build around has an element that we call 
detailer. This detailer creates a lucene Document with the fields as 
specified in the search.xml

To illustrate here is the code that specifies the detailer for a forum.
-- XML 
-















 END XML ---

Please note:
Messageid is the keyfield here when we search the index we use a 
combined TYPE + KEY id to filter out double hits on the same document 
(not unusual in for instance a long forum thread).

Per type of document we also specifify what picture to show in the 
result (image), and we specify in what index the result should be 
written and what the general search field is (if the user submits a 
query without search all and without a field specified).

We have added the 'split' key word which makes it possbile to search a 
long text but only store a bit in the resulting hit.

The reindex is pretty straightforward we build a series of detailers for 
all possible document types and we run through the database and call the 
right detailer from a HashMap.

We have not included the JDBC stuff since the application is always 
running in Tomcat-Struts and since we cache most of the database reads. 
(a completely differnt story).

Queries on new and changed records seem to only make sense if asked in a 
context of time. (Right?). We have not needed it yet. The mapping can be 
query from a singleton java class. (SearchConfiguration).

We are currently adding functionality to store 'user structured data' 
best imagined as user defined input forms that are described in XML and 
are then stored as XML in the database. We query these documents using 
Lucene. These documents end up in the same index but this is quite 
manageable by using specialized detailers. For these document the type 
is more important then for the 'normally' stored documents. For this 
latter situation the search logic assumes that the query is 
appropriately configured by the application.

I am not sure if this is the kind of solution that you are looking for, 
but everything we produce is 100% open source.

Cheers,
Aad
David Spencer wrote:
Many times I've written ad-hoc code that pulls in data from an RDBMS 
and builds a Lucene index. The use case is a typical database-driven 
dynamic website which would be a hassle to spider (say, due to tricky 
authentication).

I had a feeling this had been done in a general manner but didn't see 
any code in the sandbox, nor did any searches turn it up.

I've spent a few mins thinking this thru - what I'd expect is to be 
able to configure is:

1. JDBC Driver + conn params
2. Query to do a 1 time full index
3. Query to show new records
4. Query to show changed records
5. Query to show deleted records
6. Query columns to Lucene Field name mapping
7. "Type" of each field name (e.g. the equivalent of the args to the 
Field ctr)

So a simple example, taking item 2 is
query: "select url, name, body from foo"
(now the column to field mapping)
col 1 => url
col 2 => title
col 3 => contents
(now the field types for each named field)
url => Field( ...store=true, index=false)
title => Field( ...store=true, index=true)
contents => Field( ...store=false, index=true)

And voilla, nice, elegant, data driven indexing.
Does it exist?
Should it? :)
PS
I know in the more general form, "query" needs to be replaced by 
"queries" above, and the updated query may need some time stamp 
variable expansion, and possibly the queries need "paging" to deal w/ 
lamo DBs like mysql that don't have cursors for large result sets...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread David Spencer
Many times I've written ad-hoc code that pulls in data from an RDBMS and 
builds a Lucene index. The use case is a typical database-driven dynamic 
website which would be a hassle to spider (say, due to tricky 
authentication).

I had a feeling this had been done in a general manner but didn't see 
any code in the sandbox, nor did any searches turn it up.

I've spent a few mins thinking this thru - what I'd expect is to be able 
to configure is:

1. JDBC Driver + conn params
2. Query to do a 1 time full index
3. Query to show new records
4. Query to show changed records
5. Query to show deleted records
6. Query columns to Lucene Field name mapping
7. "Type" of each field name (e.g. the equivalent of the args to the 
Field ctr)

So a simple example, taking item 2 is
query: "select url, name, body from foo"
(now the column to field mapping)
col 1 => url
col 2 => title
col 3 => contents
(now the field types for each named field)
url => Field( ...store=true, index=false)
  title => Field( ...store=true, index=true)
   contents => Field( ...store=false, index=true)

And voilla, nice, elegant, data driven indexing.
Does it exist?
Should it? :)
PS
 I know in the more general form, "query" needs to be replaced by 
"queries" above, and the updated query may need some time stamp variable 
expansion, and possibly the queries need "paging" to deal w/ lamo DBs 
like mysql that don't have cursors for large result sets...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


competition - Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-02-01 Thread David Spencer
I wasn't sure where in this thread to reply so I'm replying to myself :)
What search appliances exist now?
I only found 3:
[1] Google
[2] Thunderstone
http://www.thunderstone.com/texis/site/pages/Appliance.html
[3] IndexEngines (not out yet)
http://www.indexengines.com/

--
Also, out of curiosity, do people have appliance h/w vendors they like?
These guys seem like they have nice options for pretty colors:
http://www.mbx.com/oem/index.cfm
http://www.mbx.com/oem/options/

David Spencer wrote:
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure 
the thing and to customize the L&F of the search results.


jian chen wrote:
Hi,
I was searching using google and just found that there was a new
feature called "google mini". Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The "nice" feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread David Spencer
Otis Gospodnetic wrote:
Adam,
Dawid posted some code that lets you use Carrot2 locally with Lucene,
see embedded zip url here for carrot2/lucene code - it may also be in 
the carrot2 cvs tree too - this is what I used in the wikipedia/cluster 
stuff as the basis

http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
without the componentized pipe line system described on Carrot2 site.


Otis
--- Adam Saltiel <[EMAIL PROTECTED]> wrote:

David, Hi,
Would you be able to comment on coincidentally recent thread " RE: ->
Grouping Search Results by Clustering Snippets:"?
Also, when I looked at Carrot2 the pipe line is implemented as over
http. I
wonder how efficient that is, or can it be changed, for instance for
an all
local implementation?
Has Carrot2 been integrated in with Lucene, has it been used as the
bases
for a recommender system (could it be?)?
TIA.
Adam

-Original Message-
From: Dawid Weiss [mailto:[EMAIL PROTECTED]
Sent: Monday, January 31, 2005 4:12 PM
To: Lucene Users List
Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
Hi.
Coming up with answers... a little belated, but hope you're still
on:
we have been experimenting with carrot2 and are very pleased so
far,
only one issue: there is no release not even an alpha one and the
dependencies seemed to be patched (jama)
Yes, there is not "official" release. We just don't feel the need
to tag
the sources with an official label because Carrot is not a
stand-alone
product (rather a library... or a framework). It does not imply
that the
project is in alpha stage... quite the contrary, in fact -- it has
been
out there for a while and it seems to do a good job for most
people.
is there any intentions to have any releases in the near future?
I could tag a release even today if it makes you happy ;) But I
hope I
made the status of the project clear above.
D.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
Xiaohong Yang (Sharon) wrote:
Hi,
 
I agree that Google mini is quite expensive.  It might be similar to the desktop version in quality.  Anyone knows google's ratio of index to text?   Is it true that Lucene's index is about 500 times the original text size (not including image size)?  I don't have one installed, so I cannot measure.
500:1 for Lucene?  I don't think so.
In my wikipedia search engine the data in the MySQL DB I index from is 
approx 1.0 GB (sum of lengths of title and body), while the Lucene index 
of just these 2 fields is 250MB, thus in this case the Lucene index is 
25% of the corpus size.


 
Best,
 
Sharon

jian chen <[EMAIL PROTECTED]> wrote:
Hi,
I was searching using google and just found that there was a new
feature called "google mini". Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The "nice" feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
Jason Polites wrote:
I think everyone agrees that this would be a very neat application of 
opensource technology like Lucene... however (opens drawer, pulls out 
devil's advocate hat, places on head)... there are several complexities 
here not addressed by Lucene (et. al).  Not because Lucene isn't damn 
fantastic, just because it's not its job.

One of the big ones is security.  Enterprise search is no good if it 
doesn't match up with the authentication and authorization paradigms 
existing in the organisation.  How useful is it to return a whole bunch 
of search results for documents to which you don't have access? Not to 
mention the issues around whether you are even authorized to know it 
exists.
I was gonna mention this - you beat me to the punch.  I suspect that 
LDAP/JNDI itegration is a start, but you need hooks for an arbitrary 
auth plugin. And once we address this it might be the case that a user 
has to *log in* to the search server.  We have Verity where I work and 
this is all the case, along w/ the fact that a sale seems to involve 
mandatory consulting work (not that that's bad, but if you're trying to 
ship a shrink wrapped search engine in a box then this is an issue).

The other prickly one is file types.  It's all well and good to index 
HTML, XML and text but when you start looking at PDF, MS Office (OLE 
docs, PSTs, Outlook MSG files, MS Project files etc), Lotus Notes 
databases etc etc, things begin to look less simple and far less elegant 
than a nice clean lucene rackmount.  Sure there are great projects like 
Apache POI but they are still have a bit of a way to go before they 
mature to a point of really solving these problems.  After which time 
Microsoft will probably be rolling out Longhorn and everyone may need to 
start from scratch.
Also need http://jcifs.samba.org/ so you can spider windows file shares.
This is not to say that it's not a great idea, but as with most great 
ideas the challenge is not the formation of the idea, but its 
implementation.
Indeed.
I think a great first step would be to start developing good, reliable, 
opensource extensions to Lucene which strive to solve some of these issues.

end rant.
- Original Message - From: "Otis Gospodnetic" 
<[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, January 28, 2005 12:40 PM
Subject: Re: rackmount lucene/nutch - Re: google mini? who needs it when 
Lucene is there


I discuss this with myself a lot inside my head... :)
Seriously, I agree with Erik.  I think this is a business opportunity.
How many people are hating me now and going "shh"?  Raise your
hands!
Otis
--- David Spencer <[EMAIL PROTECTED]> wrote:
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure
the thing and to customize the L&F of the search results.

jian chen wrote:
> Hi,
>
> I was searching using google and just found that there was a new
> feature called "google mini". Initially I thought it was another
free
> service for small companies. Then I realized that it costs quite
some
> money ($4,995) for the hardware and software. (I guess the
proprietary
> software costs a whole lot more than actual hardware.)
>
> The "nice" feature is that, you can only index up to 50,000
documents
> with this price. If you need to index more, sorry, send in the
> check...
>
> It seems to me that any small biz will be ripped off if they
install
> this google mini thing, compared to using Lucene to implement a
easy
> to use search software, which could search up to whatever number of
> documents you could image.
>
> I hope the lucene project could get exposed more to the enterprise
so
> that people know that they have not only cheaper but more
importantly,
> BETTER alternatives.
>
> Jian
>
>
-
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail:
[EMAIL PROTECTED]
>
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
This reminds me, has anyone every discussed something similar:
- rackmount server ( or for coolness factor, that mini mac)
- web i/f for config/control
- of course the server would have the following s/w:
-- web server
-- lucene / nutch
Part of the work here I think is having a decent web i/f to configure 
the thing and to customize the L&F of the search results.


jian chen wrote:
Hi,
I was searching using google and just found that there was a new
feature called "google mini". Initially I thought it was another free
service for small companies. Then I realized that it costs quite some
money ($4,995) for the hardware and software. (I guess the proprietary
software costs a whole lot more than actual hardware.)
The "nice" feature is that, you can only index up to 50,000 documents
with this price. If you need to index more, sorry, send in the
check...
It seems to me that any small biz will be ripped off if they install
this google mini thing, compared to using Lucene to implement a easy
to use search software, which could search up to whatever number of
documents you could image.
I hope the lucene project could get exposed more to the enterprise so
that people know that they have not only cheaper but more importantly,
BETTER alternatives.
Jian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: query term frequency

2005-01-27 Thread David Spencer
Jonathan Lasko wrote:
What do I call to get the term frequencies for terms in the Query?  I 
can't seem to find it in the Javadoc...
Do you mean the # of docs that have a term?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#docFreq(org.apache.lucene.index.Term)
Thanks.
Jonathan
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


lucenebook.com -- Re: Search on heterogenous index

2005-01-26 Thread David Spencer
Erik Hatcher wrote:
On Jan 26, 2005, at 5:44 AM, Simeon Koptelov wrote:
Heterogenous Documents/indices are OK - check out the second hit:
  http://www.lucenebook.com/search?query=heterogenous+different

Thanks, I'll consider buying "Lucene in Action".

Our master plan is working!  :)   Just kidding I have on my TODO 
list to aggregate more Lucene related content (like the javadocs, 
Would be nice if we could have up to date sandbox javadoc somewhere.
I've linked to some local copies of it from my page here:
http://www.searchmorph.com/pub/
Also useful would be to use the "-linksource" tag to javadoc so the 
htmlized source code is avail too, that way you have a source code search.

Maybe I should release my "javadoc" Analyzer which I use on 
searchmorph.com - it tries to do intelligent tokenization of java so 
that a word like "HashMap" becomes 3 tokens at the same offset 
('hashmap', 'hash', 'map') and which might be useful for you.

I do like the way you provide snippets from the book - nicely done.
Lucene's own documentation, perhaps a crawl of the wiki and the Lucene 
resources) into our search engine so that it becomes a richer resource 
and seems less than a marketing ploy.  Though the highlighted snippets 
do have enough information to be useful in some cases, which is nice.  I 
will start dedicating a few minutes a day to blog some useful content.

By all means, if you have other suggestions for our site, let us know at 
[EMAIL PROTECTED]

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-24 Thread David Spencer
Pierrick Brihaye wrote:
Hi,
David Spencer a écrit :
One example of expansion with the synonym boost set to 0.9 is the 
query "big dog" expands to:

Interesting.
Do you plan to add expansion on other Wordnet relationships ? Hypernyms 
and hyponyms would be a good start point for thesaurus-like search, 
wouldn't it ?
Good point, I hadn't considered this - but how would it work -just 
consider these 2 relationships "synonyms" (thus easier to use) or make 
it separate (too academic?)
However, I'm afraid that this kind of feature would require refactoring, 
probably based on WordNet-dedicated libraries. JWNL 
(http://jwordnet.sourceforge.net/) may be a good candidate for this.
Good point, should leverage existing code.

Thank you for your work.
thx,
 Dave
Cheers,

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Question about Analyzer and words spelled in different languages

2005-01-17 Thread David Spencer
markharw00d wrote:
 >>Writing this kind of an analyzer can be a bit of a hassle and the 
position increment of 0 might affect highlighting code

The highlighter in the sandbox was refactored to support these kinds of 
analyzer some time ago so it shouldn't be a problem.
Sorry - I should have noted that as I think I was the one that requested 
this change! I meant to say that position increments are tricky and an 
increment of 0 could affect all kinds of other code that doesn't know to 
look for such things.

 The Junit test 
that come with the highlighter includes a SynonymAnalyzer class which 
demonstrates this.

Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


MoreLikeThis and other similarity query generators checked in + online demo

2005-01-17 Thread David Spencer
Based on mail from Doug I wrote a "more like this" query generator, 
named, well, MoreLikeThis. Bruce Ritchie and Mark Harwood made changes 
to it (esp term vector support) and bug fixes. Thanks to everyone.

I've checked in the code to the sandbox under contributions/similarity.
The package it ends up at is org.apache.lucene.search.similar -- hope 
that makes sense.

I also created a class, SimilarityQueries, to hold other methods of 
similarity query generation. The 2 methods in there are "dumber" 
variations that use the entire source of the target doc to from a large 
query.

Javadoc is here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/package-summary.html
Online demo here - this page below compares the 3 variations on 
detecting similar docs. The timing info (3 numbers w/ "(ms)") may be 
suspect. Also note if you scroll to the bottom you can see the queries 
that were generated.

Here's a page showing docs similar to the entry for Iraq:
http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Iraq
And here's one for docs similar to the one on Garry Kasparov (he knows 
how to play chess :) ):

http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Garry_Kasparov
To get to it you start here:
http://www.searchmorph.com/kat/wikipedia.jsp
And search for something - on the search results page follow a "cmp" link
http://www.searchmorph.com/kat/wikipedia.jsp?s=iraq
Make sense? Useful? Has anyone done any other variations (e.g. cosine 
measure)?

- Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Question about Analyzer and words spelled in different languages

2005-01-17 Thread David Spencer
Mariella Di Giacomo wrote:
Hi ALL,
We are trying to index scientic articles written in english, but whose 
authors can be spelled in any language (depending on the author's 
nazionality)

E.g.
Schäffer
In the XML document that we provide to Lucene the author name is written 
in the following way (using HTML ENTITIES)

Schäffer
So in practice that is the name that would be given to a Lucene 
analyzer/filter

Is there any already written analyzer that would take that name 
(Schäffer or any other name that has entities) so that
Lucene index could searched (once the field has been indexed) for the 
real version of the name, which is

Schäffer
and the english spelled version of the name which is
Schaffer
Thanks a lot in advance for your help,
If I understand the question then I think there are 2 ways of doing it.
[1] Write a custom analyzer that uses Token.setPositionIncrement(0) to 
put alternate spellings at the same place in the token stream. This way 
phrase matches work right (so the query "Jonathan Schaffer" and 
"Jonathan Schäffer" will match the same phrase in the doc).

[2] Do not use a special analyzer - instead do query expansion, so if 
they search for "Schaffer" then the generated query is (Schaffer Schäffer).

I've used both techniques before - I use #1 w/ a "JavadocAnalyzer" on 
searchmorph.com so that if you search for "hash" you'll see matches for 
"HashMap", as "HashMap" is tokenized into 3 tokens at the same location 
( 'hash', 'map, 'hashmap').  Writing this kind of an analyzer can be a 
bit of a hassle and the position increment of 0 might affect 
highlighting code or other (say, summarizing) code that uses the Analyzer.

For an example of #2 see my Wordnet/Synonym query expansion example in 
the lucene sandbox. You prebuild an index of synonyms (or in your case 
maybe just rules are fine). Then you need query expansion code that 
takes "Schaffer" and expands it to something like "Schaffer 
Schäffer^0.9" (if you want to assume the user probably spells the name 
right). Simple enough to code, only hassle then is if you want to use 
the standard QueryParser...

thx,
 Dave



Mariella

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to add a Lucene index to a jar file?

2005-01-17 Thread David Spencer
Doug Cutting wrote:
Miles Barr wrote:
You'll have to implement org.apache.lucene.store.Directory to load the
index from the JAR file. Take a look at FSDirectory and RAMDirectory for
some more details.
Then you have either load the JAR file with java.util.jar.JarFile to get
to the files or you can use Classloader#getResourceAsStream to get to
them.

The problem is that a jar file entry becomes an InputStream, but 
InputStream is not random access, and Lucene requires random access.  So 
you need to extract the index either to disk or RAM in order to get 
random access.  I think folks have posted code for this to the list 
Isn't "ZipDirectory" the thing to search for?
previously.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-17 Thread David Spencer
Dawid Weiss wrote:
Hi David,
I apologize about the delay in answering this one, Lucene is a busy 
mailing list and I had a hectic last week... Again, sorry for belated 
answer, hope you still find it useful.
Oh no problem, and yes carrot2 is useful and fun.  It's a rich package 
so it takes a while to understand all that it can do.

That is awesome and very inspirational!

Yes, I admit what you've done with Wikipedia is quite interesting and 
looks very good. I'm also glad you spent some time working out Carrot 
integration with Lucene. It works quite nice.
Thanks but I just took code that I think you wrote(!) and made minor 
mods to it - here's one link:
http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html

I'd like to do more w/ Carrot2- that's where things get harder.

Carrot2 looks very interesting. Wondering if anybody has a list of 
all the

Technically I don't think carrot2 uses lucene per-se- it's just that 
you can integrate the two, and ditto for Nutch - it has code that uses 
Carrot2.

Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it 
merely takes the output from a query (titles, urls and snippets) and 
attempts to cluster them into some sensible groups. I think many things 
could be improved, the most important of them is fast snippet retrieval 
  from Lucene because right now it takes 50% of the time of the 
clustering; I've seen a post a while ago describing a faster snippet 
generation technique, I'm sure that would give clustering a huge boost 
speed-wise.

And here's my question. I reread the Carrot2<->Lucene code, esp 
Demo.java, and there's this fragment:

// warm-up round (stemmer tables must be read etc).
List clusters = clusterer.clusterHits(docs);
long clusteringStartTime = System.currentTimeMillis();
clusters = clusterer.clusterHits(docs);
long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above 
is just for the sake of benchmarking clusterHits() w/o the effect of 
1-time initialization - and that there's no benefit of repeatedly 
calling clusterHits (where a benefit might be that it can find nested 
clusters or whatever) - is that right (that there's no benefit)?

No, there is absolutely no benefit from it. It was merely to show people 
that the clustering needs to be warmed up a bit. I should not have put 
it in the code knowing people would be confused by it. You can safely 
use clusterHits just once. It will just have a small delay at the first 
invocation.

Thanks for experimenting. Please BCC me if you have any urgent projects 
-- I read Lucene's list in batches and my personal e-mail I try to keep 
up to date with.

Dawid
thx,
 Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to get the "context" of the searched word ?

2005-01-16 Thread David Spencer
Eternal Security wrote:
Hello all
When i search a word on documents, i want to get also the "context" of the 
word, for examples the 10 words before and after.
Like in search engine like google or yahoo.
So i have tried to use the Highlight modules, but Highlight need to get a 
String of the contents.
I use this function :
highlighter.highlightText(myString);
myString must to be the contents of the whole document !
But of course i don't want to store all the contents in the index files !
How can i do ?
I'm  sure i'm not the first lucene's user who want to get the "context" of the 
searched word.
I need to use Highlight or do you know a better way ?
This is just the way it is - you can always compress the content. Only 
alternative is enabling term vectors for the body field - then - (um..I 
think) you don't need the entire content separately - instead is comes 
from the Lucene index.
To store the term vector the 3rd arg here will be true.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.lang.String,%20boolean)
I believe the index size will increase, but I haven't measured it.


Thanks in advance.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


carrot2 question too - Re: Fun with the Wikipedia

2005-01-14 Thread David Spencer
[EMAIL PROTECTED] wrote:
That is awesome and very inspirational!
Thank you.
Carrot2 looks very interesting. Wondering if anybody has a list of all the
Technically I don't think carrot2 uses lucene per-se- it's just that you 
can integrate the two, and ditto for Nutch - it has code that uses Carrot2.

This post is where the code I used as a basis came from:
http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
This is URL w/ the code:
http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip
And here's my question. I reread the Carrot2<->Lucene code, esp 
Demo.java, and there's this fragment:

// warm-up round (stemmer tables must be read etc).
List clusters = clusterer.clusterHits(docs);
long clusteringStartTime = System.currentTimeMillis();
clusters = clusterer.clusterHits(docs);
long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above is 
just for the sake of benchmarking clusterHits() w/o the effect of 1-time 
initialization - and that there's no benefit of repeatedly calling 
clusterHits (where a benefit might be that it can find nested clusters 
or whatever) - is that right (that there's no benefit)?


academic research projects using Lucene. The only other one that I know of
is Striver - which uses a support vector machine to learn the ranking
function: http://www.cs.cornell.edu/People/tj/career/
Could always search citeseer for mentions of Lucene too.
Aneesha

For my own amusement I've indexed the Wikipedia and put up pages that:
- display search results
- cluster the results using Carrot2 (my first use of this)
- display similar pages using the entire text to re-query for similar
docs and
- display similar pages using the "more like this" algorithm (TBD is get
this into the sandbox, sorry for delays..)
You start off here to search:
http://www.searchmorph.com/kat/wikipedia.jsp
And the weblog entry goes into a bit more detail:
http://www.searchmorph.com/weblog/index.php?id=37

It's kinda fun to explore the Wikipedia by looking for pages similar to
other ones.
Hope people find this useful...
- Dave
PS
  I'm in the process of running the page rank algorithm (from
jung.sf.net) on most of the entries in the Wikipedia. It has taken over
2 days so far
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Fun with the Wikipedia

2005-01-14 Thread David Spencer
For my own amusement I've indexed the Wikipedia and put up pages that:
- display search results
- cluster the results using Carrot2 (my first use of this)
- display similar pages using the entire text to re-query for similar 
docs and
- display similar pages using the "more like this" algorithm (TBD is get 
this into the sandbox, sorry for delays..)

You start off here to search:
http://www.searchmorph.com/kat/wikipedia.jsp
And the weblog entry goes into a bit more detail:
http://www.searchmorph.com/weblog/index.php?id=37

It's kinda fun to explore the Wikipedia by looking for pages similar to 
other ones.

Hope people find this useful...
- Dave
PS
  I'm in the process of running the page rank algorithm (from 
jung.sf.net) on most of the entries in the Wikipedia. It has taken over 
2 days so far

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: full text as input ?

2005-01-13 Thread David Spencer
Hunter Peress wrote:
is it efficient and feasible to use lucene to do full text
comparisions. eg :  take an entire text thats reasonably large ( eg
more than 10 words) and find the result set within the lucene search
index that  is statistically similar with all the text.
I do this kind of stuff all the time, no problem.
I think this came up a month ago - probably appears monthly.
For another variation search for "MoreLikeThis" in the list - it's code 
I mailed in that I haven't, yet, checked in.

Anyway, if you want to search for docs that are similar to a source 
document, you can all this method to generate a similarity query.

'srch' is the source doc
'a' is your analyzer
'field' is the field that stores the body e.g. "contents"
'stop' is an opt Set of stop words to ignore as an optimization - it's 
not needed if the Analyzer ignores stop words, but if you keep stop 
words you might still want to ignore them in this kind of query as they 
probably won't help

  public static Query formSimilarQuery( String srch, 
Analyzer a,	String field,	Set stop)
		throws org.apache.lucene.queryParser.ParseException, IOException
	{	
		TokenStream ts = a.tokenStream( field, new StringReader( srch));
		org.apache.lucene.analysis.Token t;
		BooleanQuery tmp = new BooleanQuery();
		Set already = new HashSet();
		while ( (t = ts.next()) != null)
		{
			String word = t.termText();
			if ( stop != null &&
 stop.contains( word)) continue;
			if ( ! already.add( word)) continue;
			TermQuery tq = new TermQuery( new Term( field, word));
			tmp.add( tq, false, false);
		}

// tbd, from lucene in action book
// 
https://secure.manning.com/catalog/view.php?book=hatcher2&item=source
// exclude myself
//likeThisQuery.add(new TermQuery(
//new Term("isbn", doc.get("isbn"))), false, true);
return tmp;
}
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


stop words and index size

2005-01-13 Thread David Spencer
Does anyone know how much stop words are supposed to affect the index size?
I did an experiment of building an index once with, and once without, 
stop words.

The corpus is the English Wikipedia, and I indexed the title and body of 
the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.
Thus, the index grows by 45% in this case, which I found suprising, as I 
expected it to not grow as much. I haven't dug into the details of the 
Lucene file formats but thought compression (field/term vector/sparse 
lists/ vints) would negate the affect of stopwords to a large extent.

Some more details + a link to my stopword list are here:
http://www.searchmorph.com/weblog/index.php?id=36
-- Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-11 Thread David Spencer
Erik Hatcher wrote:
On Jan 10, 2005, at 6:54 PM, David Spencer wrote:
Hi...I wrote the WordNet sandbox code - but I'm not sure if I 
undertand this thread. Are we saying that it does not work w/ the new 
WordNet data, or that code in Eric's book is better/more up to date etc?

I have not tried the sandbox with any versions past WordNet 1.6.  
Karthik shows a Java API to it, which I have not used - only your code 
that parses the prolog files.  So the book code explains exactly what is 
in the sandbox and describes WordNet 1.6 integration.  Though WordNet 
has evolved.

If needed I can update the sandbox code..

It'd be awesome to have current WordNet support - I haven't looked at 
what is involved in making it so.

I verified that the code works w/ the latest WordNet (2.0), and it does 
so, no problem. The relevant data from WordNet has not changed so 
there's no need to upgrade WordNet for this package at least.

I added "query expansion" which takes in a simple query string and for 
every term adds their synonyms. There's an optional boost parameter to 
be used to "penalize" synonyms if you want to use the heuristic that the 
 user probably knows the right word.

One example of expansion with the synonym boost set to 0.9 is the query 
"big dog" expands to:

big adult^0.9 bad^0.9 bighearted^0.9 boastful^0.9 boastfully^0.9 
bounteous^0.9 bountiful^0.9 braggy^0.9 crowing^0.9 freehanded^0.9 
giving^0.9 grown^0.9 grownup^0.9 handsome^0.9 large^0.9 liberal^0.9 
magnanimous^0.9 momentous^0.9 openhanded^0.9 prominent^0.9 swelled^0.9 
vainglorious^0.9 vauntingly^0.9
 dog andiron^0.9 blackguard^0.9 bounder^0.9 cad^0.9 chase^0.9 click^0.9 
detent^0.9 dogtooth^0.9 firedog^0.9 frank^0.9 frankfurter^0.9 frump^0.9 
heel^0.9 hotdog^0.9 hound^0.9 pawl^0.9 tag^0.9 tail^0.9 track^0.9 
trail^0.9 weenie^0.9 wiener^0.9 wienerwurst^0.9

Amusingly then, documents with the terms "liberal wienerwurst" match 
"big dog"! :)

Javadoc is here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/WordNet/build/docs/api/org/apache/lucene/wordnet/package-summary.html
The new query expansion is here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/WordNet/build/docs/api/org/apache/lucene/wordnet/SynExpand.html
Want to try it out? This page *expands* a query and prints out the 
result (but doesn't execute it yet).
http://www.searchmorph.com/kat/synonym.jsp?syn=big

CVS tree here:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/WordNet/
If you just want to use a prebuild index it's here (1MB):
http://searchmorph.com/pub/syn_index.zip
The prebuilt jar file is here:
http://www.searchmorph.com/pub/lucene-wordnet-dev.jar
Redundant weblog entry here:
http://www.searchmorph.com/weblog/index.php?id=34
Hope y'all like it and someone finds it useful,
  Dave
PS
 Oh - it may need the 1.5 dev branch of Lucene to work - I'm not 
positive but it I tried to remove deprecated warnings and doing so may 
have tied it to the latest code...

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: SYNONYM + GOOGLE

2005-01-10 Thread David Spencer
Erik Hatcher wrote:
Karthik,
Thanks for that info.  I knew I was behind the times with WordNet using  
the sandbox code, but it was good enough for my purposes at the time.   
I will definitely try out the latest WordNet offerings in the future  
Hi...I wrote the WordNet sandbox code - but I'm not sure if I undertand 
this thread. Are we saying that it does not work w/ the new WordNet 
data, or that code in Eric's book is better/more up to date etc?

If needed I can update the sandbox code..
thx,
 Dave

though.
Erik
On Jan 10, 2005, at 7:37 AM, Karthik N S wrote:
Hi Erik
Apologies...
I may be a little offline from this form,but I may help u for the next
version of Luncene In Action.
 I Was working on Java WordNet Library , On fiddling with the API's,  
found
something Interesting ,

 the code attached to this  get's more Synonyms then the Wordnet's  
Indexed
format avaliable from the LuceneinAction Zip File


1) It needs Wordnet2.0's Dictonery  Installed
2) jwnl.jar from SourceForge
[
http://sourceforge.net/project/showfiles.php? 
group_id=33824&package_id=33975
&release_id=196864 ]

After sucess compilation
Type for watch
ORIGINAL  : "watch" OR "analog_watch" OR "digital_watch" OR "hunter" OR
"hunting_watch" OR "pendulum_watch" OR
"pocket_watch" OR "stem-winder" OR "wristwatch" OR  
"wrist_watch"

FORMATTED : "watch" OR "analog watch" OR "digital watch" OR "hunter" OR
"hunting watch" OR "pendulum watch" OR "pocket watch"
Check this Out,may be u will come up with Briliant Idea's

with regards
Karthik
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, January 10, 2005 5:19 PM
To: Lucene Users List
Subject: Re: SYNONYM + GOOGLE

On Jan 10, 2005, at 5:33 AM, Karthik N S wrote:
If u search Google  using  '~shoes',  It returns  hits  based on the
Synonym's
[ I know there is a Synonym Wordnet  based Lucene Package in the
sandbox
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/
contributions/WordN
et/   ]
Can this be achieved in Lucene ,If so How ???

Yes, it can be achieved.  Not quite synonyms, but various forms of the
same word can be found in this example, like this search for similar
(see the highlighted variations):
http://www.lucenebook.com/search?query=similar
This is accomplished using the Snowball stemmer filter found in the
sandbox.   For synonyms, you have lots of options.  In Lucene in Action
I demonstrate custom analyzers that inject synonyms using the WordNet
database (from the sandbox).  From the source code distribution of LIA:
% ant SynonymAnalyzerViewer
Buildfile: build.xml
SynonymAnalyzerViewer:
  [echo]
  [echo]   Using a custom SynonymAnalyzer, two fixed strings  are
  [echo]   analyzed with the results displayed.  Synonyms, from
the
  [echo]   WordNet database, are injected into the same  
positions
  [echo]   as the original words.
  [echo]
  [echo]   See the "Analysis" chapter for more on synonym
injection and
  [echo]   position increments.  The "Tools and extensions"
chapter covers
  [echo]   the WordNet feature found in the Lucene sandbox.
  [echo]
 [input] Press return to continue...

  [echo] Running lia.analysis.synonym.SynonymAnalyzerViewer...
  [java] 1: [quick] [warm] [straightaway] [spry] [speedy] [ready]
[quickly] [promptly] [prompt] [nimble] [immediate] [flying] [fast]
[agile]
  [java] 2: [brown] [brownness] [brownish]
  [java] 3: [fox] [trick] [throw] [slyboots] [fuddle] [fob]  [dodger]
[discombobulate] [confuse] [confound] [befuddle] [bedevil]
  [java] 4: [jumps]
  [java] 5: [over] [o] [across]
  [java] 6: [lazy] [faineant] [indolent] [otiose] [slothful]
  [java] 7: [dogs]
...
The phrase analyzed was "The quick brown fox jumps over the lazy dogs".
  Why no synonyms for "jumps" and "dogs"?  WordNet has synonyms for
"jump" and "dog", but not the plural forms.  Stemming would be a
necessary step in achieving full synonym look-up, though this would
need to be done carefully as the stem of a word is not necessarily a
real word itself - so you'd probably want to stem the synonym database
also to ensure accurate lookup.
Also notice the semantically incorrect synonyms that appear for the
animal fox ("confuse", for example).  Be careful!  :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How do you handle dynamic html pages?

2005-01-10 Thread David Spencer
Kevin L. Cobb wrote:
I don't like to periodically re-index everything because 1) you can't be
confident that your searches are as up to date as they could be, and 2)
you are wasting cycles either checking for documents that may or may not
need to be updated, or re-indexing documents that don't need updated. 
And what I've noticed is that typically systems with dynamic content 
that comes from, say, a database, do not implment the HTTP "HEAD" verb 
nor "if-modified-since", which a smart spider might try to use to be 
more efficient. Thus an "incremental" spider run can be just as 
expensive as the first one.

Ideally, I think that you want an event driven system where the content
management system or the like indicates to your searcher engine when a
page/document gets updated. That way, you know that documents are as up
to date as possible in terms of searches, and you know that you aren't
doing unnecessary work. 

 

-Original Message-
From: Luke Francl [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 10, 2005 11:09 AM
To: Lucene Users List
Subject: Re: How do you handle dynamic html pages?

On Mon, 2005-01-10 at 10:03, Jim Lynch wrote:
How is anyone managing reindexing of pages that change?  Just 
periodically reindex everything or do you try to determine frequency
of 

each changes to each page and/or site? 

If you are using a CMS, your best bet is to integrate Lucene with the
CMS's content update mechanism. That way, your index will always be
up-to-date.
Otherwise, I would say reindexing everything is easiest, provided it
doesn't take too long. If it's ~15 minutes or less, you could schedule a
processes to do it at a low activity period (2 AM or whenever) every day
and that would probably handle your needs.
Regards,
Luke Francl
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Query based stemming

2005-01-07 Thread David Spencer
Jim Lynch wrote:
 From what I've read, if you want to have a choice, the easiest way is 
to index the documents twice. Once with stemming on and once with it off 
placing the results in two different indexes.  Then at query time, 
select which index you want to use based on whether you want stemming on 
or off.
IMHO keeping the data in the same index is easiest.
PerFieldAnalyzerWrapper is part of the magic...approx uasge follows from 
my code below. Second magic is to call doc.add(...) multiple times, 
"redundantly".

Don't use code below exactly however - things like MySnowballAnalyzer 
should become SnowballAnalyzer in your code...

Analyzer fa;
Analyzer getAnalyzer()
{
	Analyzer snowball = new MySnowballStopAnalyzer();
	Analyzer def = new AlphaNumStopAnalyzer();  // prob StandardAnalyzer 
for most people..
	PerFieldAnalyzerWrapper fa = new PerFieldAnalyzerWrapper( def);
	fa.addAnalyzer( "scontents", snowball);  // "s" in "scontents" if for 
stemming
	fa.addAnalyzer( "stitle", snowball);		
	return fa;
}

...
later:
Document doc = new Document();
doc.add( Field.Text( "title", title));
doc.add( Field.Text( "stitle", new StringReader( title))); // don't need 
recall
String body = ...;
doc.add( Field.Text( "contents", new StringReader( body), true)); // 
term vector
doc.add( Field.Text( "scontents", new StringReader( body)));
writer.addDocument( doc);


Jim.
Peter Kim wrote:
Hi,
I'm new to Lucene, so I apologize if this issue has been discussed
before (I'm sure it has), but I had a hard time finding an answer using
google. (Maybe this would be a good candidate for the FAQ!) :)
Is it possible to enable stem queries on a per-query basis? It doesn't
seem to be possible since the stem tokenizing is done during the
indexing process. Are people basically stuck with having all their
queries stemmed or none at all?
Thanks!
Peter
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Quick question about highlighting.

2005-01-07 Thread David Spencer
Jim Lynch wrote:
I've read as much as I could find on the highlighting that is now in the 
sandbox.  I didn't find the javadocs.
I have a copy here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/overview-summary.html
  I found a link to them, but it
redirected my to a cvs tree.
Do I assume that you have to store the content of the document for the 
highlighting to work?  
Not per se, but you do need access to the contents to pass to 
Highlighter.getBestFragments(). You can store the contents in the index, 
or you can have in a cache, DB, or you can refetch the doc...

You need to know what Analyzer you used too to get the tokenStream via:
TokenStream tokenStream = analyzer.tokenStream( field, new 
StringReader(body));


Otherwise I don't see how it could work.
Thanks,
Jim.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


google suggest / incremental search - Re: Lucene appreciation

2004-12-17 Thread David Spencer
Rony Kahan wrote:
Thanks for feedback.
PA - Since rss readers usually visit at least once per day, we only show
jobs from past few days. This allows us to use a smaller, faster index for
traffic intensive rss searching.
Ben & Praveen - Thanks for the UI suggestions. Hope to have that %3A & %22
cleared up shortly. I saw in the Lucene sandbox a "Javascript Query
Constructor". Does anyone have any experience using this to make a google
I've done 2 versions, this one "closer" to google:
http://www.searchmorph.com/kat/isearch2.jsp
And this "v1" w/ frames, thus not the same but a step:
http://www.searchmorph.com/kat/isearch.html
SearchMorph is mainly a lucene index of javadoc-generated pages from OSS 
projects..


like advanced search page? Regarding open source technology - we also use
Carrot: http://sourceforge.net/projects/carrot2/
Kind Regards,
Rony

-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 16, 2004 4:29 PM
To: Lucene Users List
Subject: Re: Lucene appreciation

On Dec 16, 2004, at 17:26, Rony Kahan wrote:

If you are interested in Lucene work you can set up an rss feed or
email alert from here: 
http://www.indeed.com/search?q=lucene&sort=date
Looks great :)
One thing though, the web search returns 14 hits for the above query.
Using the RSS feed only returns 4 of them. What gives?
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-15 Thread David Spencer
Christoph Kiefer wrote:
David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and dirty wrote a class to perform that task. It uses the Jama.Matrix
class to represent the similarity matrix at the moment. For show and
tell I attached it to this email.
Unfortunately it doesn't perform very well. My index stores about 25000
docs with a total of 75000 terms. The similarity matrix is very sparse
but nevertheless needs about 1'875'000'000 entries!!! I think this
current implementation will not be useable in this way. I also think I
switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason.
What do you think?
I don't have any deep thoughts, just a few questions/ideas...
[1] TFIDFMatrix, FeatureVectorSimilarityMeasure, and CosineMeasure are 
your classes right, which are not in the mail, but presumably the source 
isn't needed.

[2] Does the problem boil down to this line and the memory usage?
double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];
Thus using a sparse matrix would be a win, and so would using floats 
instead of doubles?

[3] Prob minor, but in getTFIDFMatrix() you might be able to ignore stop 
words, as you do so later in getSimilarity().

[4] You can also consider using Colt possibly even JUNG:
http://www-itg.lbl.gov/~hoschek/colt/api/cern/colt/matrix/impl/SparseDoubleMatrix2D.html
http://jung.sourceforge.net/doc/api/index.html
[5]
Related to #2, can you precalc the matrix and store it on disk, or is 
your index too dynamic?

[6] Also, in similar kinds of calculations I've seen code that filters 
out low frequency terms e.g. ignore all terms that don't occur in at 
least 5 docs.

-- Dave

Best,
Christoph


/*
 * Created on Dec 14, 2004
 */
package ch.unizh.ifi.ddis.simpack.measure.featurevectors;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import Jama.Matrix;
/**
 * @author Christoph Kiefer
 */
public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure {

private File indexDir = null;
private File dataDir = null;
private String target = "";
private String query = "";
private int targetDocumentNumber = -1;
private final String ME = this.getClass().getName();
private int fileCounter = 0;

public TFIDF_Lucene( String indexDir, String dataDir, String target, 
String query ) {
this.indexDir = new File(indexDir);
this.dataDir = new File(dataDir);
this.target = target;
this.query = query;
}

public String getName() {
return "TFIDF_Lucene_Similarity_Measure";
}

private void makeIndex() {
try {
IndexWriter writer = new IndexWriter(indexDir, new 
SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS ), false);
indexDirectory(writer, dataDir);
writer.optimize();
writer.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}

private void indexDirectory(IndexWriter writer, File dir) {
File[] files = dir.listFiles();
for (int i=0; i < files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(writer, f);  // recurse
} else if (f.getName().endsWith(".txt")) {
indexFile(writer, f);
}
}
}

private void indexFile(IndexWriter writer, File f) {
try {
System.out.println( "Indexing " + f.getName() + ", " + 
(fileCounter++) );
String name = f.getCanonicalPath();
//System.out.println(name);
Document doc = new Document();
doc.add( Field.Text( "contents", new FileReader(f), 
true ) );
writer.addDocument( doc );
 

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
 

You can also see 'Books like this' example from here 

https://secure.manning.com/catalog/view.php?book=hatcher2&item=source
Well done, uses a term vector, instead of reparsing the orig 
doc, to form the similarity query. Also I like the way you 
exclude the source doc in the query, I didn't think of doing 
that in my code.

I agree, it's a good way to exclude the source doc.
 

I don't trust calling vector.size() and vector.getTerms() 
within the loop but I haven't looked at the code to see if it 
calculates  the results each time or caches them...

From the code I looked at, those calls don't recalculate on every call. 
I was referring to this fragment below from BooksLikeThis.docsLike(), 
and was mentioning it as the javadoc 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html 
does not say that the values returned by size() and getTerms() are 
cached, and while the impl may cache them (haven't checked) it's not 
guarenteed, thus it's safer to put the size() and getTerms() call 
outside the loop.

 for (int j = 0; j < vector.size(); j++) {
  TermQuery tq = new TermQuery(
  new Term("subject", vector.getTerms()[j]));

Regards,
Bruce Ritchie
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
From the code I looked at, those calls don't recalculate on 
every call. 

I was referring to this fragment below from BooksLikeThis.docsLike(), 
and was mentioning it as the javadoc 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in
dex/TermFreqVector.html 
does not say that the values returned by size() and getTerms() are 
cached, and while the impl may cache them (haven't checked) it's not 
guarenteed, thus it's safer to put the size() and getTerms() call 
outside the loop.

 for (int j = 0; j < vector.size(); j++) {
  TermQuery tq = new TermQuery(
  new Term("subject", vector.getTerms()[j]));

I agree on your overall point that it's probably best to put those calls outside of the loop, I was just saying that I did look at the implementation and the calls do not recalculate anything. I'm sorry I didn't explain myself clearly enough.
Oh oh oh, sorry, 10-4, no prob.

Regards,
Bruce Ritchie
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Otis Gospodnetic wrote:
You can also see 'Books like this' example from here
https://secure.manning.com/catalog/view.php?book=hatcher2&item=source
Well done, uses a term vector, instead of reparsing the orig doc, to 
form the similarity query. Also I like the way you exclude the source 
doc in the query, I didn't think of doing that in my code.

I don't trust calling vector.size() and vector.getTerms() within the 
loop but I haven't looked at the code to see if it calculates  the 
results each time or caches them...
Otis
--- Bruce Ritchie <[EMAIL PROTECTED]> wrote:

Christoph,
I'm not entirely certain if this is what you want, but a while back
David Spencer did code up a 'More Like This' class which can be used
for generating similarities between documents. I can't seem to find
this class in the sandbox so I've attached it here. Just repackage
and test.
Regards,
Bruce Ritchie
http://www.jivesoftware.com/   


-Original Message-
From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
Sent: December 14, 2004 11:45 AM
To: Lucene Users List
Subject: TFIDF Implementation

Hi,
My current task/problem is the following: I need to implement 
TFIDF document term ranking using Jakarta Lucene to compute a 
similarity rank between arbitrary documents in the constructed
index.
I saw from the API that there are similar functions already 
implemented in the class Similarity and DefaultSimilarity but 
I don't know exactly how to use them. At the time my index 
has about 25000 (small) documents and there are about 75000 
terms stored in total.
Now, my question is simple. Does anybody has done this before 
or could point me to another location for help?

Thanks for any help in advance.
Christoph 

--
Christoph Kiefer
Department of Informatics, University of Zurich
Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [RFE] IndexWriter.updateDocument()

2004-12-14 Thread David Spencer
petite_abeille wrote:
Well, the subject says it all...
If there is one thing which is overly cumbersome in Lucene, it's 
updating documents, therefore this Request For Enhancement:

Please consider enhancing the IndexWriter API to include an 
updateDocument(...) method to take care of all the gory details involved 
in such operation.
I agree, this is always a hassle to do right due to having to use 
IndexWriter and IndexReader and properly opening/closing them.

I have a prelim version of a "batched index writer" that I use. The code 
is kinda messy, but for discussion here's what it does:

Briefly the methods are:
// [1]
// the ctr has parameters:
//'batch size # docs' e.g. it will flush pending updates every 100 docs
//'batch freq' e.g. auto flush every 60 sec
// [2]
// queue a document to be added to the index
// 'key' is the primary key name e.g. "url"
// 'val' is the primary key val e.g. "http://www.tropo.com/";
// 'doc' is the doc to be added
update( String key, String val, Document doc)
// [3]
//  queue a document for removal
// 'key' and 'val' are the params, as from [2]
remove( String key, String val)
// [4]
// periodic flush, called automatically or on demand, 2 stages:
// 1. call IndexReader.delete() on all pending (key,val) pairs
// 2. close IndexReader
// 3. call IndexWriter.add() on all pending documents
// 4. optionally call optimze()
// 5. close IndexWriter
flush()

//
So in normal usage you just keep calling update() and it peridically 
flushes the pending updates to the index. By its nature this uses memory 
however it's tunable as to how many documents it'll queue in memory.

Does the algorithm above, esp flush(), sound correct? It seems to work 
right for me and I can post this if people want to see it...

- Dave

Thanks in advance.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
Christoph,
I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox 
Ot oh, sorry, I'll try to get this checked in soonish. For me it's 
always one thing to do a prelim version of a piece of code, but another 
matter to get it correctly packasged.

so I've attached it here. Just repackage and test.

An alternate approach to find "similar" docs is to use all (possibly 
unique) tokens in the  source doc to form a large query. This is code I use:

'srch' is the entire untokenized text of the source doc
'a' is the analyzer you want to use
'field' is the field you want to search on e.g. "contents" or "body"
'stop' is an opt set of stop words to ignore
It returns a query, which you then use to search for "similar" docs, and 
then in the return result you need to make sure you ignore the source 
doc, which will prob come back 1st. You can use stemming, synonyms, or 
fuzzy expansion for each term too.

public static Query formSimilarQuery( 
String srch,			Analyzer a,	String field,	Set stop)
throws org.apache.lucene.queryParser.ParseException, IOException
{	
	TokenStream ts = a.tokenStream( "foo", new StringReader( srch));
	org.apache.lucene.analysis.Token t;
	BooleanQuery tmp = new BooleanQuery();
	Set already = new HashSet();
	while ( (t = ts.next()) != null)
	{
		String word = t.termText();
		if ( stop != null &&
			 stop.contains( word)) continue;
		if ( ! already.add( word)) continue;
		TermQuery tq = new TermQuery( new Term( field, word));
		tmp.add( tq, false, false);
	}
	return tmp;

}

Regards,
Bruce Ritchie
http://www.jivesoftware.com/   


-Original Message-
From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
Sent: December 14, 2004 11:45 AM
To: Lucene Users List
Subject: TFIDF Implementation

Hi,
My current task/problem is the following: I need to implement 
TFIDF document term ranking using Jakarta Lucene to compute a 
similarity rank between arbitrary documents in the constructed index.
I saw from the API that there are similar functions already 
implemented in the class Similarity and DefaultSimilarity but 
I don't know exactly how to use them. At the time my index 
has about 25000 (small) documents and there are about 75000 
terms stored in total.
Now, my question is simple. Does anybody has done this before 
or could point me to another location for help?

Thanks for any help in advance.
Christoph 

--
Christoph Kiefer
Department of Informatics, University of Zurich
Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: LIMO problems

2004-12-13 Thread David Spencer
Daniel Cortes wrote:
Hi, I want to know what library do you use for search in PPT files?
I use this ("native code"):
http://chicago.sourceforge.net/xlhtml
POI support this?
thanks
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-12 Thread David Spencer
Chris Lamprecht wrote:
Very cool, thanks for posting this!  

Google's feature doesn't seem to do a search on every keystroke
necessarily.  Instead, it waits until you haven't typed a character
for a short period (I'm guessing about 100 or 150 milliseconds).  So
Thx again for tip, I updated my experiment 
(http://www.searchmorph.com/kat/isearch.html, as per below) to use a 
150ms delay to avoid some needless searches...TBD is more intelligent 
"guidance" or suggestions for the user.

if you type fast, it doesn't hit the server until you pause.  There
are some more detailed postings on slashdot about how it works.
On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer
<[EMAIL PROTECTED]> wrote:
Google just came out with a page that gives you feedback as to how many
pages will match your query and variations on it:
http://www.google.com/webhp?complete=1&hl=en
I had an unexposed experiment I had done with Lucene a few months ago
that this has inspired me to expose - it's not the same, but it's
similar in that as you type in a query you're given *immediate* feedback
as to how many pages match.
Try it here: http://www.searchmorph.com/kat/isearch.html
This is my "SearchMorph" site which has an index of ~90k pages of open
source javadoc packages.
As you type in a query, on every keystroke it does at least one Lucene
search to show results in the bottom part of the page.
It also gives spelling corrections (using my "NGramSpeller"
contribution) and also suggests popular tokens that start the same way
as your search query.
For one way to see corrections in action, type in "rollback" character
by character (don't do a cut and paste).
Note that:
-- this is not how the Google page works - just similar to it
-- I do single word suggestions while google does the more useful whole
phrase suggestions (TBD I'll try to copy them)
-- They do lots of javascript magic, whereas I use old school frames mostly
-- this is relatively expensive, as it does 1 query per character, and
when it's doing spelling correction there is even more work going on
-- this is just an experiment and the page may be unstable as I fool w/ it
What's nice is when you get used to immediate results, going back to the
"batch" way of searching seems backward, slow, and old fashioned.
There are too many idle CPUs in the world - this is one way to keep them
busier :)
-- Dave
PS Weblog entry updated too:
http://www.searchmorph.com/weblog/index.php?id=26
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-11 Thread David Spencer
Chris Lamprecht wrote:
Very cool, thanks for posting this!  

Google's feature doesn't seem to do a search on every keystroke
necessarily.  Instead, it waits until you haven't typed a character
for a short period (I'm guessing about 100 or 150 milliseconds). 
Ohh, good point - I was wondering how to "cancel" a URL - this is the 
right way. I'll try to code that in.

I also realized they're prob not doing searches at all - instead they're 
going off a DB of query popularity - I wanted to code up something 
generic, based just on term frequency but I don't think it'll be useful 
e.g. let's say
in my index (index of javadoc-generated documentation) the user types in
"hash" - well a human might guess that they intend "hash map" or 
"hashmap" or "hash tree" but I'm sure other terms are more frequent in 
my index than "map" and "tree"...I'm sure "hash java" occurs more 
frequently than "hash map" - or any other freq, non-stop word, and it's 
dubious that "hash java" is a useful suggestion...

So
if you type fast, it doesn't hit the server until you pause.  There
are some more detailed postings on slashdot about how it works.
On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer
<[EMAIL PROTECTED]> wrote:
Google just came out with a page that gives you feedback as to how many
pages will match your query and variations on it:
http://www.google.com/webhp?complete=1&hl=en
I had an unexposed experiment I had done with Lucene a few months ago
that this has inspired me to expose - it's not the same, but it's
similar in that as you type in a query you're given *immediate* feedback
as to how many pages match.
Try it here: http://www.searchmorph.com/kat/isearch.html
This is my "SearchMorph" site which has an index of ~90k pages of open
source javadoc packages.
As you type in a query, on every keystroke it does at least one Lucene
search to show results in the bottom part of the page.
It also gives spelling corrections (using my "NGramSpeller"
contribution) and also suggests popular tokens that start the same way
as your search query.
For one way to see corrections in action, type in "rollback" character
by character (don't do a cut and paste).
Note that:
-- this is not how the Google page works - just similar to it
-- I do single word suggestions while google does the more useful whole
phrase suggestions (TBD I'll try to copy them)
-- They do lots of javascript magic, whereas I use old school frames mostly
-- this is relatively expensive, as it does 1 query per character, and
when it's doing spelling correction there is even more work going on
-- this is just an experiment and the page may be unstable as I fool w/ it
What's nice is when you get used to immediate results, going back to the
"batch" way of searching seems backward, slow, and old fashioned.
There are too many idle CPUs in the world - this is one way to keep them
busier :)
-- Dave
PS Weblog entry updated too:
http://www.searchmorph.com/weblog/index.php?id=26
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-10 Thread David Spencer
Google just came out with a page that gives you feedback as to how many 
pages will match your query and variations on it:

http://www.google.com/webhp?complete=1&hl=en
I had an unexposed experiment I had done with Lucene a few months ago 
that this has inspired me to expose - it's not the same, but it's 
similar in that as you type in a query you're given *immediate* feedback 
as to how many pages match.

Try it here: http://www.searchmorph.com/kat/isearch.html
This is my "SearchMorph" site which has an index of ~90k pages of open 
source javadoc packages.

As you type in a query, on every keystroke it does at least one Lucene 
search to show results in the bottom part of the page.

It also gives spelling corrections (using my "NGramSpeller" 
contribution) and also suggests popular tokens that start the same way 
as your search query.

For one way to see corrections in action, type in "rollback" character 
by character (don't do a cut and paste).

Note that:
-- this is not how the Google page works - just similar to it
-- I do single word suggestions while google does the more useful whole 
phrase suggestions (TBD I'll try to copy them)
-- They do lots of javascript magic, whereas I use old school frames mostly
-- this is relatively expensive, as it does 1 query per character, and 
when it's doing spelling correction there is even more work going on
-- this is just an experiment and the page may be unstable as I fool w/ it

What's nice is when you get used to immediate results, going back to the 
"batch" way of searching seems backward, slow, and old fashioned.

There are too many idle CPUs in the world - this is one way to keep them 
busier :)

-- Dave
PS Weblog entry updated too: 
http://www.searchmorph.com/weblog/index.php?id=26



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Single Digit Indexing

2004-12-06 Thread David Spencer
Otis Gospodnetic wrote:
Hm, if you can index 11, you should be able to index 8 as well.  In any
case, you most likely want to make sure that your Analyzer is not just
In theory you could have a  "length" filter tossing out tokens that are 
too short or too long, and maybe you're getting rid of all tokens less 
than 2 chars...


throwing your numbers out.  This may stillbe up to date:
http://www.jguru.com/faq/view.jsp?EID=538308
See also: http://wiki.apache.org/jakarta-lucene/HowTo
Otis
--- "Bill von Ofenheim (LaRC)" <[EMAIL PROTECTED]> wrote:

How can I get Lucene to index single digits (e.g. "8" as in "Gemini
8")?
I am able to index numbers with two or more digits (e.g. "11" as in
"Apollo 11").
Thanks,
Bill von Ofenheim

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Looking for consulting help on project

2004-10-27 Thread David Spencer
Suggestions
[a]
Try invoking the VM w/ an option like "-XX:CompileThreshold=100" or even 
a smaller number. This encourages the hotspot VM to compile methods 
sooner, thus the app will take less time to "warm up".

http://java.sun.com/docs/hotspot/VMOptions.html#additional
You might want to search the web for refs to this, esp how things like 
Eclipse is brought up, as I think their invocation script sets other 
obscure options to guide GC too.

[b]
Any time I've worked w/ a hard core java server I've always found it 
helpful to have a loop explicitly trying to force gc - this is the idiom 
I use (i.e. you may have to do more than just System.gc()), and my 
suggestion is to try calling this every 15-60 secs so that memory use 
never jumps. I know that in theory you should never need to, but it may 
help.

public static long gc()
{
long bef = mem();
System.gc();
sleep( 100);
System.runFinalization();
sleep( 100);
System.gc();
long aft= mem();
return aft-bef;
}
Gordon Riggs wrote:
Hi,
 
I am working on a web development project using PHP and mySQL. The team has
implemented full text search with mySQL, but is now researching Lucene to
help with performance/scalability issues. The team is looking for a
developer who has experience working with Lucene and can assist with
integrating into our environment. What follows is a brief overview of the
problems that we're working to address. If you have the experience with
using Lucene with large amounts of data (we have roughly 16 million records)
where search time is critical (needs to be under .2 seconds), then please
respond.
 
Thanks,
Gordon Riggs
[EMAIL PROTECTED]
 
1. Loading index into memory using Lucene's RAMDirectory
Why is the Java heap 2.9GB for a 1.4GB index?
Why can we not load an index over 1.4GB in size?  We receive
'java.lang.OutOfMemoryError' even with the -mx flag set to as high as '10g'.
We're using a dedicated test machine which has dual AMD Opteron processors
and 12GB of memory.  The OS is SuSE Linux Enterprise Server 9 (x86_64).  The
java version is: Java(TM) 2 Runtime Environment, Standard Edition (build
Blackdown-1.4.2) Java HotSpot(TM) 64-Bit Server VM (build
Blackdown-1.4.2-fcs, mixed mode)
We also get similar results with: Java(TM) 2 Runtime Environment, Standard
Edition (build 1.4.2_03-b02) Java HotSpot(TM) Client VM (build 1.4.2_03-b02,
mixed mode)

2. How to keep Lucene and Java in memory, to improve performance
The idea is to have a Lucene "daemon" that loads the index into memory once
on startup. It then listens for connections and performs search requests for
clients using that single index instance.
Do you foresee any problems (other than the ones stated above) with this
approach?
Garbage collection and/or memory leaks?  Performance issues?  
Concurrency issues with multiple searches coming in at once?
What's involved in writing the daemon?
Assuming that we need the daemon, we need to find out how big a job it is to
develop, what requirements need to be specified, etc.

3. How to interface our PHP web application with Java
Our web application is written in PHP so we need a communication interface
for performing search queries that is both PHP and Java friendly.
What do you think would be a good solution?  XML-RPC?
What's involved in developing the solution?
4. How to tune Lucene
Are there ways to "tune" Lucene in order to improve performance? We already
plan on moving the index into memory.
What else can be done to improve the search times? Can the way the index is
built affect performance?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Thesaurus ...

2004-10-19 Thread David Spencer
Erik Hatcher wrote:
Have a look at the WordNet contribution in the Lucene sandbox 
repository.  It could be leveraged for part of a solution.
It's something I contributed.
Relevant links are:
http://jakarta.apache.org/lucene/docs/lucene-sandbox/
http://www.tropo.com/techno/java/lucene/wordnet.html
Basically it uses the Lucene index as a kind of associated array to map 
words to their synonyms using the thesaurus from Wordnet, so a key like, 
say, "fast" will have mappings to "quick" and "rapid". This can then be 
used for query expansion.

An example of this expansion in use is here:
http://www.hostmon.com/rfc/advanced.jsp

Erik
On Oct 19, 2004, at 12:40 PM, Patricio Galeas wrote:
Hello,
I'm a new user of Lucene, and a would like to use it to create a 
Thesaurus.
Do you have any idea to do this?  Thanks!

kind regards
P.Galeas


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Efficient search on lucene mailing archives

2004-10-14 Thread David Spencer
sam s wrote:
Hi Folks,
Is there any place where I can do a better search on lucene mailing 
archives?
I tried JGuru and looks like their search is paid.
Apache maintained archives lags efficient searching.
Of course one of the ironies is, shouldn't we be able to use Lucene to 
search the mailing list archives and even apache.org?
Thanks in advance,
s s
_
Don't just search. Find. Check out the new MSN Search! 
http://search.msn.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Highlighting PDF file after the search

2004-09-20 Thread David Spencer
[EMAIL PROTECTED] wrote:

Hello,
I can successfully index and search the PDF documents, however i am not
able to highlight the searched text in my original PDF file (ie: like
dtSearch
highlights on original file)
I took a look at the highlighter in sandbox, compiled it and have it
ready.  I am wondering if this highlighter is for highlighting indexed
documents or
can it be used for PDF Files as is !  Please enlighten !
I did this a few weeks ago.
There are two ways, and they both revolve round the same thing, you need 
the tokenized PDF text available.

[a] Store the tokenized PDF text in the index, or in some other file on 
disk i.e. a "cache" ( but cache is a misleading term, as you can't have 
a cache miss unless you can do [b]).

[b] Tokenize it on the fly when you call getBestFragments() - the 1st 
arg, the TokenStream, should be one that takes a PDF file as input and 
tokenizes it.

http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/org/apache/lucene/search/highlight/Highlighter.html#getBestFragments(org.apache.lucene.analysis.TokenStream,%20java.lang.String,%20int,%20java.lang.String)
Thanks,
Vijay Balasubramanian
DPRA Inc.,
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-16 Thread David Spencer
Morus Walter wrote:
Hi David,
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

great :-)
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
could you put the current version of your code on that website as a java
Weblog entry updated:
http://searchmorph.com/weblog/index.php?id=23
To link to source code:
http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java
source also? At least until it's in the lucene sandbox.
I created an ngram index on one of my indexes and think I found an issue
in the indexing code:
There is an option -f to specify the field on which the ngram index will
be created. 
However there is no code to restrict the term enumeration on this field.

So instead of 
		final TermEnum te = r.terms();
i'd suggest
		final TermEnum te = r.terms(new Term(field, ""));
and a check within the loop over the terms if the enumerated term
still has fieldname field, e.g.
			Term t = te.term();
			if ( !t.field().equals(field) ) {
			break;
			}

otherwise you loop over all terms in all fields.
Great suggestion and thanks for that idiom - I should know such things 
by now. To clarify the "issue", it's just a performance one, not other 
functionality...anyway I put in the code - and to be scientific I 
benchmarked it two times before the change and two times after - and the 
results were suprising the same both times (1:45 to 1:50 with an index 
that takes up > 200MB). Probably there are cases where this will run 
faster, and the code seems more "correct" now so it's in.



An interesting application of this might be an ngram-Index enhanced version
of the FuzzyQuery. While this introduces more complexity on the indexing
side, it might be a large speedup for fuzzy searches.
I also thinking of reviewing the list to see if anyone had done a "Jaro 
Winkler" fuzzy query yet and doing that



Thanks,
 Dave
Morus
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


IndexReader.close() semantics and optimize -- Re: problem with locks when updating the data of a previous stored document

2004-09-16 Thread David Spencer
Crump, Michael wrote:
You have to close the IndexReader after doing the delete, before opening the 
IndexWriter for the addition.  See information at this link:
http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex
Recently I thought I observed that if I use this batch update idiom (1st 
delete the changed docs, then add them), it seems that 
IndexReader.close() does not flush/commit the deletions - rather 
IndexWriter.optimize() does.

I may have been confused and should retest this, but regardless, the 
javadoc seems unclear. close() says it "*saves* deletions to disk". What 
does it mean to save a deletion? Save a pending one, or commit it 
(commit -> really delete it)?

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#close()
Also optimize doesn't mention deletions.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#optimize()
Suggestion: could the word "save" in the close() jdoc be elaborated on, 
and possibly could optimize() get another comment wrt its effect on 
deletions?

thx,
 Dave

Regards,
Michael
-Original Message-
From:   Paul Williams [mailto:[EMAIL PROTECTED]
Sent:   Thu 9/16/2004 5:39 AM
To: 'Lucene Users List'
Cc: 
Subject:problem with locks when updating the data of a previous stored document
Hi,
Using lucene-1.4.1.jar on WinXP  

I am having trouble with locking and updating an existing Lucene document. I
delete the old document from the index and then add the new document to the
index writer. I am using the minMerge docs set to 100 (much quicker!!) and
close the writer once the batch is done, so the documents are flushed to the
filesystem
The problem i am having is I can't delete the old version of the document
(after the first document has been added) using reader.delete because there
is a lock on the index due to the IndexWriter being open.
Am I doing this wrong or is there a simple way round this?
Regards,
Paul
Code snippets of the update code (I have just cut and pasted the relevant
line from my app to get an idea)
reader = IndexReader.open(location);
// Delete old doc/term if present
if (reader.docFreq(docNumberTerm) > 0) {
reader.delete(docNumberTerm);
.
.
.
IndexWriter writer = null;
// get the writer from the hash table so last few are cached and don't
have to be restarted
synchronized(IndexWriterCache) {
   String dbstring = "" + ldb;
   writer = (IndexWriter)IndexWriterCache.get(dbstring);
   if (writer == null) {
   //Not in cache so create one and add to cache for next time
   writer = new IndexWriter(location, new StandardAnalyzer(),
new_index);
   writer.setUseCompoundFile(true);
   // Set the maximum number of entries per field. Default is
10,000
   writer.maxFieldLength = MaxFieldCount;
   // Set how many docs will be stored in memory before being
saved to disk
   writer.minMergeDocs = (int) DocsInMemory;
   IndexWriterCache.remove(dbstring);
   IndexWriterCache.put(dbstring, writer);
}
.
.
.
  
// Add the docuents to the Lucene index
writer.addDocument(doc);


.
. Some time later after a batch of docs been added
 
	   writer.close();




DISCLAIMER:
The information in this message is confidential and may be legally
privileged. It is intended solely for the addressee. Access to this message
by anyone else is unauthorised. If you are not the intended recipient, any
disclosure, copying, or distribution of the message, or any action or
omission taken by you in reliance on it, is prohibited and may be unlawful.
Please immediately contact the sender if you have received this message in
error.
Thank you.
Valid Information Systems Limited. Address: Morline House, 160 London Road,
Barking, Essex, IG11 8BB. 

http://www.valinf.com Tel: +44 (0) 20 8215 1414 Fax: +44 (0) 20 8215 2040
Please note that as part of our drive to continually improve the service to
our clients, we have established a dedicated support line for customers to 
call if they are in need of help with their installation of R/KYV or have a
query regarding the operation of the software. The number is - 0870 0161414
This will ensure any call is carefully noted, any action required is 
scheduled for completion and any problem experienced handled by a carefully
chosen team of developers. Please make a note of this number and pass it 
on to any other relevant person within your organisation.
 
*

--
Visit Valid who will sharing a stand with partners, Goss Interactive at the
SOCITM Event, 10- 12 October 2004, Edinburgh International Confer

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a 
term in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of 
frequency,  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare 
terms in the query might be misspelled (i.e. not what the user 
intended) and we suggest alternativies to these words too (in addition 
to the words in the query that are not in the index at all).

Almost.
If the user enters "a recursize purser", then: "a", which is in, say, 
 >50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".

If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.
I updated the code to have an optional popularity filter - if true then 
it only returns matches more popular (frequent) than the word that is 
passed in for spelling correction.

If true (default) then for common words like "remove", no results are 
returned now, as expected:

http://www.searchmorph.com/kat/spell.jsp?s=remove
But if you set it to false (bottom slot in the form at the bottom of the 
page) then the algorithm happily looks for alternatives:

http://www.searchmorph.com/kat/spell.jsp?s=remove&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=1.0&popular=0
TBD I need to update the javadoc & repost the code I guess. Also as per 
earlier post I also store simple transpositions for words in the 
ngram-index.

-- Dave
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words 
like this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...
conts -> con ont nts
const -> con ons nst
Thus they just share 1 3gram, thus this is why it scores so low. This 
is an interesting issue, how to tune the algorithm so that it might 
return words this close higher.

If you added 2-grams, then it would look like this (constructing also 
special start/end grams):
Oh cute trick to indicate prefixes and suffixes.
Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5, 
and also store transpositions, and w/ appropriate boosts :) then "const" 
is returned 2nd.

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1
conts -> _c co on nt ts s_
const -> _c co on ns st t_
which gives 50% of overlap.
In another system that I designed we were using a combination of 2-4 
grams, albeit for a slightly different purpose, so in this case it would 
be:

conts:
_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
const:
_c co on ns st t_, _co con ons nst st_, _con cons onst nst_
and the overlap is 40%.

I guess one way is to add all simple transpositions to the lookup 
table (the "ngram index") so that these could easily be found, with 
the heuristic that "a frequent way of misspelling words is to 
transpose two adjacent letters".

Yes, sounds like a good idea. Even though it increases the size of the 
lookup index, it still beats using the linear search...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote:
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be: 

1. my expectations (most likely ;-)
2. something in the code..

I enhanced the code to store simple transpositions also and I 
regenerated my site w/ ngrams from 2 to 5 chars. If you set the 
transposition boost up to 10 then "const" is returned 2nd...

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene

Aad Nales wrote:

David,
Perhaps I misunderstand somehting so please correct me if I do. I used

http://www.searchmorph.com/kat/spell.jsp to look for conts without 
changing any of the default values. What I got as results did not 
include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.


should have a pretty low levenshtein distance. Any idea what causes 
this behavior?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote:
Aad Nales wrote:
David,
Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.
I added a line of output to the right column under the 'details' box. 
"const" appears 216 times in the index (out of 96k docs), thus it is 
indeed kinda rare.

http://www.searchmorph.com/kat/spell.jsp?s=const

should have a pretty low levenshtein distance. Any idea what causes this
behavior?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote:
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be: 

1. my expectations (most likely ;-)
2. something in the code..

Good question.
If I use the form at the bottom of the page and ask for more results, 
the suggestion of "const" does eventually show up - 99th however(!).

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=2.0&bend=1.0
Even boosting the prefix match from 2.0 to 10.0 only changes the ranking 
a few slots.
http://www.searchmorph.com/kat/spell.jsp?s=conts&min=3&max=4&maxd=5&maxr=1000&bstart=10.0&bend=1.0

To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as 
it's just a transposition away, thus the string distance is low.

But - I guess the problem w/ the algorithm is that for short words like 
this, with transpositions, the two words won't share many ngrams.

Just looking at 3grams...
conts -> con ont nts
const -> con ons nst
Thus they just share 1 3gram, thus this is why it scores so low. This is 
an interesting issue, how to tune the algorithm so that it might return 
words this close higher.

I guess one way is to add all simple transpositions to the lookup table 
(the "ngram index") so that these could easily be found, with the 
heuristic that "a frequent way of misspelling words is to transpose two 
adjacent letters".

Based on other mails I'll make some additions to the code and will 
report back if anything of interest changes here.



-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 15 September, 2004 12:23
To: Lucene Users List
Subject: Re: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene

Aad Nales wrote:

David,
Perhaps I misunderstand somehting so please correct me if I do. I used

http://www.searchmorph.com/kat/spell.jsp to look for conts without 
changing any of the default values. What I got as results did not 
include 'const' which has quite a high frequency in your index and

??? how do you know that? Remember, this is an index of _Java_docs, and 
"const" is not a Java keyword.


should have a pretty low levenshtein distance. Any idea what causes 
this behavior?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a 
term in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of 
frequency,  it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare 
terms in the query might be misspelled (i.e. not what the user 
intended) and we suggest alternativies to these words too (in addition 
to the words in the query that are not in the index at all).

Almost.
If the user enters "a recursize purser", then: "a", which is in, say, 
 >50% of the documents, is probably spelled correctly and "recursize", 
which is in zero documents, is probably mispelled.  But what about 
"purser"?  If we run the spell check algorithm on "purser" and generate 
"parser", should we show it to the user?  If "purser" occurs in 1% of 
documents and "parser" occurs in 5%, then we probably should, since 
"parser" is a more common word than "purser".  But if "parser" only 
occurs in 1% of the documents and purser occurs in 5%, then we probably 
shouldn't bother suggesting "parser".
OK, sure, got it.
I'll give it a think and try to add this option to my just submitted 
spelling code.


If you wanted to get really fancy then you could check how frequently 
combinations of query terms occur, i.e., does "purser" or "parser" occur 
more frequently near "descent".  But that gets expensive.
Yeah, expensive for a large scale search engine, but probably 
appropriate for a desktop engine.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
...or prepare in advance a fast lookup index - split all existing
terms to bi- or trigrams, create a separate lookup index, and then
simply for each term ask a phrase query (phrase = all n-grams from
an input term), with a slop > 0, to get similar existing terms.
This should be fast, and you could provide a "did you mean"
function too...
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
phases. First you build a "fast lookup index" as mentioned above.
Then to correct a word you do a query in this index based on the
ngrams in the misspelled word.
The background for this suggestion was that I was playing some time ago 
with a Luke plugin that builds various sorts of ancillary indexes, but 
then I never finished it... Kudos for actually making it work ;-)
Sure, it was a fun little edge project. For the most part the code was 
done last week right after this thread appeared, but it always takes a 
while to get it from 95 to 100%.


[1] Source is attached and I'd like to contribute it to the sandbox,
esp if someone can validate that what it's doing is reasonable and
useful.

There have been many requests for this or similar functionality in the 
past, I believe it should go into sandbox.

I was wondering about the way you build the n-gram queries. You 
basically don't care about their position in the input term. Originally 
I thought about using PhraseQuery with a slop - however, after checking 
the source of PhraseQuery I realized that this probably wouldn't be that 
fast... You use BooleanQuery and start/end boosts instead, which may 
give similar results in the end but much cheaper.

I also wonder how this algorithm would behave for smaller values of 
Sure, I'll try to rebuild the demo w/ lengths 2-5 (and then the query 
page can test any conitguous combo).

start/end lengths (e.g. 2,3,4). In a sense, the smaller the n-gram 
length, the more "fuzziness" you introduce, which may or may not be 
desirable (increased recall at the cost of precision - for small indexes 
this may be useful from the user's perspective because you will always 
get a plausible hit, for huge indexes it's a loss).

[2] Here's a demo page. I built an ngram index for ngrams of length 3
 and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like
"recursixe" or whatnot to see what suggestions it returns. Note this
is not a normal search index query -- rather this is a test page for
spelling corrections.

http://www.searchmorph.com/kat/spell.jsp

Very nice demo! 
Thanks, kinda designed for ngram-nerds if you know what I mean :)
I bet it's running way faster than the linear search 
Indeed, this is almost zero time, whereas the simple and dumb linear 
search was taking me 10sec. I will have to redo the sites main search 
page so it uses this new code, TBD, prob tomorrow.

over terms :-), even though you have to build the index in advance. But 
if you work with static or mostly static indexes this doesn't matter.

Based on a subsequent mail in this thread I set boosts for the words
in the ngram index. The background is each word (er..term for a given
 field) in the orig index is a separate Document in the ngram index.
This Doc contains all ngrams (in my test case, like #2 above, of
length 3 and 4) of the word. I also set a boost of
log(word_freq)/log(num_docs) so that more frequently words will tend
to be suggested more often.

You may want to experiment with 2 <= n <= 5. Some n-gram based 
Yep, will do prob tomorrow.
techniques use all lengths together, some others use just single length, 
results also vary depending on the language...

I think in "plain" English then the way a word is suggested as a 
spelling correction is: - frequently occuring words score higher -
words that share more ngrams with the orig word score higher - words
that share rare ngrams with the orig word score higher

I think this is a reasonable heuristics. Reading the code I would 
present it this way:
ok, thx, will update
- words that share more ngrams with the orig word score higher, and
  words that share rare ngrams with the orig word score higher
  (as a natural consequence of using BooleanQuery),
- and, frequently occuring words score higher (as a consequence of using
  per-Document boosts),
- from reading the source code I see that you use Levenshtein distance
  to prune the resultset of too long/too short results,
I think also that because you don't use the positional information about 
 the input n-grams you may be getting some really weird hits.
Good point, though I haven't seen this yet. Might be due to the prefix 
boost and maybe some Markov chain magic tending to only show reasonable 
words.

You could 
prune them by simply checking if you find a (threshold) of input ngrams 
in the right sequence in the 

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Tate Avery wrote:
I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp
How embarassing!
Sorry!
Fixed!


T
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users List
Subject: NGramSpeller contribution -- Re: combining open office
spellchecker with Lucene
Andrzej Bialecki wrote:

David Spencer wrote:

I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...


Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]
If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...

Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 
phases. First you build a "fast lookup index" as mentioned above. Then 
to correct a word you do a query in this index based on the ngrams in 
the misspelled word.

Let's see.
[1] Source is attached and I'd like to contribute it to the sandbox, esp 
if someone can validate that what it's doing is reasonable and useful.

[2] Here's a demo page. I built an ngram index for ngrams of length 3 
and 4 based on the existing index I have of approx 100k 
javadoc-generated pages. You type in a misspelled word like "recursixe" 
or whatnot to see what suggestions it returns. Note this is not a normal 
search index query -- rather this is a test page for spelling corrections.

http://www.searchmorph.com/kat/spell.jsp
[3] Here's the javadoc:
http://www.searchmorph.com/pub/ngramspeller/org/apache/lucene/spell/NGramSpeller.html
[4] Here's source in HTML:
http://www.searchmorph.com/pub/ngramspeller/src-html/org/apache/lucene/spell/NGramSpeller.html#line.152
[5] A few more details:
Based on a subsequent mail in this thread I set boosts for the words in 
the ngram index. The background is each word (er..term for a given 
field) in the orig index is a separate Document in the ngram index. This 
Doc contains all ngrams (in my test case, like #2 above, of length 3 and 
4) of the word. I also set a boost of log(word_freq)/log(num_docs) so 
that more frequently words will tend to be suggested more often.

I think in "plain" English then the way a word is suggested as a 
spelling correction is:
- frequently occuring words score higher
- words that share more ngrams with the orig word score higher
- words that share rare ngrams with the orig word score higher

[6]
If people want to vote me in as a committer to the sandbox then I can 
check this code in - though again, I'd appreciate feedback.

thx,
 Dave

package org.apache.lucene.spell;

/* 
 * The Apache Software License, Version 1.1
 *
 * Copyright (c) 2001-2003 The Apache Software Foundation.  All rights
 * reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * 1. Redistributions of source code must retain the above copyright
 *notice, this list of conditions and the following disclaimer.
 *
 * 2. Redistributions in binary form must reproduce the above copyright
 *notice, this list of conditions and the following disclaimer in
 *the documentation and/or other materials provided with the
 *distribution.
 *
 * 3. The end-user documentation included with the redistribution,
 *if any, must include the following acknowledgment:
 *   "This product includes software developed by the
 *Apache Software Foundation (http://www.apache.org/)."
 *Alternately, this acknowledgment may appear in the software itself,
 *if and wherever such third-party acknowledgments normally appear.
 *
 * 4. The names "Apache" and "Apache Software Foundation" and
 *"Apache Lucene" must not be used to endorse or promote products
 *derived from this software without prior written permission. For
 *written permission, please contact [EMAIL PROTECTED]
 *
 * 5. Products derived from this software may not be called "Apache",
 *"Apache Lucene", nor may "Apache" appear in their name, without
 *prior written permission of the Apache Software Foundation.
 *
 * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
 * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONT

Re: PorterStemfilter

2004-09-14 Thread David Spencer
Honey George wrote:
Hi,
 This might be more of a questing related to the
PorterStemmer algorithm rather than with lucene, but
if anyone has the knowledge please share.
You might want to also try the Snowball stemmer:
http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
And KStem:
http://ciir.cs.umass.edu/downloads/
I am using the PorterStemFilter that some with lucene
and it turns out that searching for the word 'printer'
does not return a document containing the text
'print'. To narrow down the problem, I have tested the
PorterStemFilter in a standalone programs and it turns
out that the stem of printer is 'printer' and not
'print'. That is 'printer' is not equal to 'print' +
'er', the whole of the word is stem. Can somebody
explain the behavior.
Thanks & Regards,
   George



___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OutOfMemory example

2004-09-13 Thread David Spencer
Daniel Naber wrote:
On Monday 13 September 2004 15:06, JiÅÃ Kuhn wrote:

   I think I can reproduce memory leaking problem while reopening
an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My
JVM is:

Could you try with the latest Lucene version from CVS? I cannot reproduce 
your problem with that version (Sun's Java 1.4.2_03, Linux).
I verified it w/ the latest lucene code from CVS under win xp.
Regards
 Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


SegmentReader - Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Another clue, the SegmentReaders are piling up too, which may be why the 
 Comparator map is increasing in size, because SegmentReaders are the 
keys to Comparator...though again, I don't know enough about the Lucene 
internals to know what refs to SegmentReaders are valid which which ones 
may be causing this leak.

David Spencer wrote:
David Spencer wrote:
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like 
things are never removed from it

Replying to my own postthis could be the problem.
If I put in a print statement here in FieldSortedHitQueue, recompile, 
and run w/ the new jar then I see Comparators.size() go up after every 
iteration thru ReopenTest's loop and the size() never goes down...

 static Object store (IndexReader reader, String field, int type, Object 
factory, Object value) {
FieldCacheImpl.Entry entry = (factory != null)
  ? new FieldCacheImpl.Entry (field, factory)
  : new FieldCacheImpl.Entry (field, type);
synchronized (Comparators) {
  HashMap readerCache = (HashMap)Comparators.get(reader);
  if (readerCache == null) {
readerCache = new HashMap();
Comparators.put(reader,readerCache);
System.out.println( "*\t* NOW: "+ Comparators.size());
  }
  return readerCache.put (entry, value);
}
  }


JiÅÃ Kuhn wrote:
This doesn't work either!
Lets concentrate on the first version of my code. I believe that the 
code should run endlesly (I have said it before: in version 1.4 final 
it does).

Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
JiÅÃ Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a 
stand-alone code with main().

What about slow garbage collector? This looks for me as wrong 
suggestion.


I've seen this written up before (javaworld?) as a way to probably 
"force" GC instead of just a System.gc() call. I think the 2nd gc() 
call is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, 
InterruptedException
   {
   Directory directory = create_index();

   for (int i = 1; i < 100; i++) {
   System.err.println("loop " + i + ", index version: " + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...

and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
David Spencer wrote:
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like 
things are never removed from it
Replying to my own postthis could be the problem.
If I put in a print statement here in FieldSortedHitQueue, recompile, 
and run w/ the new jar then I see Comparators.size() go up after every 
iteration thru ReopenTest's loop and the size() never goes down...

 static Object store (IndexReader reader, String field, int type, 
Object factory, Object value) {
FieldCacheImpl.Entry entry = (factory != null)
  ? new FieldCacheImpl.Entry (field, factory)
  : new FieldCacheImpl.Entry (field, type);
synchronized (Comparators) {
  HashMap readerCache = (HashMap)Comparators.get(reader);
  if (readerCache == null) {
readerCache = new HashMap();
Comparators.put(reader,readerCache);
		System.out.println( "*\t* NOW: "+ Comparators.size());
  }
  return readerCache.put (entry, value);
}
  }

JiÅÃ Kuhn wrote:
This doesn't work either!
Lets concentrate on the first version of my code. I believe that the 
code should run endlesly (I have said it before: in version 1.4 final 
it does).

Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
JiÅÃ Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a 
stand-alone code with main().

What about slow garbage collector? This looks for me as wrong 
suggestion.


I've seen this written up before (javaworld?) as a way to probably 
"force" GC instead of just a System.gc() call. I think the 2nd gc() 
call is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, 
InterruptedException
   {
   Directory directory = create_index();

   for (int i = 1; i < 100; i++) {
   System.err.println("loop " + i + ", index version: " + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...

and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like 
things are never removed from it

JiÅÃ Kuhn wrote:
This doesn't work either!
Lets concentrate on the first version of my code. I believe that the code should run 
endlesly (I have said it before: in version 1.4 final it does).
Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
JiÅÃ Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().
What about slow garbage collector? This looks for me as wrong suggestion.

I've seen this written up before (javaworld?) as a way to probably 
"force" GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, InterruptedException
   {
   Directory directory = create_index();
   for (int i = 1; i < 100; i++) {
   System.err.println("loop " + i + ", index version: " + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...
and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
JiÅÃ Kuhn wrote:
This doesn't work either!
You're right.
I'm running under JDK1.5 and trying larger values for -Xmx and it still 
fails.

Running under (Borlands) OptimzeIt shows the number of Terms and 
Terminfos (both in org.apache.lucene.index) increase every time thru the 
loop, by several hundred instances each.

I can trace thru some Term instances on the reference graph of 
OptimizeIt but it's unclear to me what's right. One *guess* is that 
maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the 
problem.



Lets concentrate on the first version of my code. I believe that the code should 
run endlesly (I have said it before: in version 1.4 final it does).
Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
JiÅÃ Kuhn wrote:

Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().
What about slow garbage collector? This looks for me as wrong suggestion.

I've seen this written up before (javaworld?) as a way to probably 
"force" GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();

Let change the code once again:
...
   public static void main(String[] args) throws IOException, InterruptedException
   {
   Directory directory = create_index();
   for (int i = 1; i < 100; i++) {
   System.err.println("loop " + i + ", index version: " + 
IndexReader.getCurrentVersion(directory));
   search_index(directory);
   add_to_index(directory, i);
   System.gc();
   Thread.sleep(1000);// whatever value you want
   }
   }
...
and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
JiÅÃ Kuhn wrote:
Thanks for the bug's id, it seems like my problem and I have a stand-alone code with 
main().
What about slow garbage collector? This looks for me as wrong suggestion.

I've seen this written up before (javaworld?) as a way to probably 
"force" GC instead of just a System.gc() call. I think the 2nd gc() call 
is supposed to clean up junk from the runFinalization() call...

System.gc();
Thread.sleep( 100);
System.runFinalization();
Thread.sleep( 100);
System.gc();
Let change the code once again:
...
public static void main(String[] args) throws IOException, InterruptedException
{
Directory directory = create_index();
for (int i = 1; i < 100; i++) {
System.err.println("loop " + i + ", index version: " + 
IndexReader.getCurrentVersion(directory));
search_index(directory);
add_to_index(directory, i);
System.gc();
Thread.sleep(1000);// whatever value you want
}
}
...
and in the 4th iteration java.lang.OutOfMemoryError appears again.
Jiri.
-Original Message-
From: John Moylan [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 4:53 PM
To: Lucene Users List
Subject: Re: OutOfMemory example
http://issues.apache.org/bugzilla/show_bug.cgi?id=30628
you can close the index, but the Garbage Collector still needs to 
reclaim the memory and it may be taking longer than your loop to do so.

John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote:
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a 
large proportion of the collection.

I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely 
that the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I think you misunderstood me.  What I meant to say was that if the term 
the user enters is very common then spell correction may be skipped. 
Very common words which are similar to the term the user entered should 
of course be shown.  But if the user's term is very common one need not 
even attempt to find similarly-spelled words.  Is that any better?
Yes, sure, thx, I understand now - but maybe not - the context I was 
something like this:

[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a term 
in the index, but the next 2 are. So it ignores the last 2 terms 
("recursive" and "descent") and suggests alternatives to 
"recursize"...thus if any term is in the index, regardless of frequency, 
 it is left as-is.

I guess you're saying that, if the user enters a term that appears in 
the index and thus is sort of spelled correctly ( as it exists in some 
doc), then we use the heuristic that any sufficiently large doc 
collection will have tons of misspellings, so we assume that rare terms 
in the query might be misspelled (i.e. not what the user intended) and 
we suggest alternativies to these words too (in addition to the words in 
the query that are not in the index at all).


Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 

Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  

And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.
I keep thinking over this one and I don't understand it. If a user 
misspells a word and the "did you mean" spelling correction algorithm 
determines that a frequent term is a good suggestion, why not suggest 
it? The very fact that it's common could mean that it's more likely that 
the user wanted this word (well, the heuristic here is that users 
frequently search for frequent terms, which is probabably wrong, but 
anyway..).

I know in other contexts of IR frequent terms are penalized but in this 
context it seems that frequent terms should be fine...

-- Dave

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
eks dev wrote:
Hi Doug,

Perhaps.  Are folks really better at spelling the
beginning of words?

Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars match).
Jaro-Winkler is well documented and some folks thinks
that it is much more efficient and precise than plain
Edit distance (of course for normal language, not
numbers or so).
I will try to dig-out some references from my disk on
Good ole Citeseer finds 2 docs that seem relevant:
http://citeseer.ist.psu.edu/cs?cs=1&q=Winkler+Jaro&submit=Documents&co=Citations&cm=50&cf=Any&ao=Citations&am=20&af=Any
I have some of the ngram spelling suggestion stuff, based on earlier 
msgs in this thread, working in my dev tree. I'll try to get a test site 
up later today for people to fool around with.


this topic, if you are interested.
On another note,
I would even suggest using Jaro-Winkler distance as
default for fuzzy query. (one could configure max
prefix required => prefix query to reduce number of
distance calculations). This could speed-up fuzzy
search dramatically.
Hope this was helpful,
Eks

  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 

Note that a spell checker used with a search engine should use 
collection frequency information.  That's to say, only "corrections" 
which are more frequent in the collection than what the user entered 
should be displayed.  Frequency information can also be used when 
constructing the checker.  For example, one need never consider 
proposing terms that occur in very few documents.  And one should not 
try correction at all for terms which occur in a large proportion of the 
collection.
Good heuristics but are there any more precise, standard guidelines as 
to how to balance or combine what I think are the following possible 
criteria in suggesting a better choice:

- ignore(penalize?) terms that are rare
- ignore(penalize?) terms that are common
- terms that are closer (string distance) to the term entered are better
- terms that start w/ the same 'n' chars as the users term are better


Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Existing Parsers

2004-09-09 Thread David Spencer
Honey George wrote:
Hi,
  I know some of them.
1. PDF
 + http://www.pdfbox.org/
 + http://www.foolabs.com/xpdf/download.html
   - I am using this and found good. It even supports 
My dated experience from 2 years ago was that (the evil, native code) 
foolabs pdf parser was the best, but obviously things could have changed.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg02912.html
 various languages.
2. word
  + http://sourceforge.net/projects/wvware
3. excel
  + http://www.jguru.com/faq/view.jsp?EID=1074230
-George
 --- [EMAIL PROTECTED] wrote: 

Anyone know of any reliable parsers out there for
pdf word 
excel or powerpoint?
For powerpoint it's not easy. I've been using this and it has worked 
fine util recently and seems to sometimes go into an infinite loop now 
on some recent PPTs. Native code and a package that seems to be dormant 
but to some extent it does the job. The file "ppthtml" does the work.

http://chicago.sourceforge.net/xlhtml

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate 
the Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.

...or prepare in advance a fast lookup index - split all existing terms 
to bi- or trigrams, create a separate lookup index, and then simply for 
each term ask a phrase query (phrase = all n-grams from an input term), 
with a slop > 0, to get similar existing terms. This should be fast, and 
you could provide a "did you mean" function too...
Sounds interesting/fun but I'm not sure if I'm following exactly.
Let's talk thru the trigram index case.
Are you saying that for every trigram in every word there will be a 
mapping of trigram -> term?
Thus if "recursive" is in the (orig) index then we'd create entries like:

rec -> recursive
ecu -> ...
cur -> ...
urs -> ...
rsi -> ...
siv -> ...
ive -> ...
And so on for all terms in the orig index.
OK fine.
But now the user types in a query like "recursivz".
What's the algorithm - obviously I guess take all trigrams in the bad 
term and go thru the trigram-index, but there will be lots of 
suggestions. Now what - use string distance to score them? I guess that 
makes sense - plz confirm if I understand And so I guess the point 
here is we precalculate the trigram->term mappings to avoid an expensive 
traversal of all terms in an index, but we still use string distance as 
a 2nd pass (and prob should force the matches to always match on the 1st 
  n (3) chars using the heuristic that people can usually start the 
spelling a word  corrrectly).




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Aad Nales wrote:
Hi All,
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create an Analyzer
based on the spellchecker of OpenOffice. My question is: "has anybody
tried this before?" 
I did a WordNet/synonym query expander. Search for "WordNet" on this 
page. Of interest is it stores the Wordnet info in a separate Lucene 
index as at its essence all an index is is a database.

http://jakarta.apache.org/lucene/docs/lucene-sandbox/
Also, another variation, is to instead spell based on what terms are in 
the index, not what an external dictionary says. I've done this on my 
experimental site searchmorph.com in a dumb/inefficient way. Here's an 
example:

http://www.searchmorph.com/kat/search.jsp?s=recursivz
After you click above it takes ~10sec as it produces terms close to 
"recursivz". Opps - looking at the output, it looks like the same word 
is suggest multiple times - ouch - I must be considering all fields, not 
just the contents field. TBD is fixing this. (or no wonder it's so slow :))

I can/should send the code out. The logic is that for any terms in a 
query that have zero matches, go thru all the terms(!) and calculate the 
Levenshtein string distance, and return the best matches. A more 
intelligent way of doing this is to instead look for terms that also 
match on the 1st "n" (prob 3) chars.



Cheers,
Aad
--
Aad Nales
[EMAIL PROTECTED], +31-(0)6 54 207 340 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Full web search engine package using Lucene

2004-09-08 Thread David Spencer
Anne Y. Zhang wrote:
Hi, I am assistanting a professor for a IR course. 
We need to provide the student with a full-fuctioned
search engine package, and the professor prefers it
being powered by lucene. Since I am new to lucene,
can anyone provide me some information that where
can I get the package? We also want the package 
contains the crawling function. Thank you very much!
http://www.nutch.org/
Ya

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


IndexSearcher.close() and aborting searches in progress

2004-09-08 Thread David Spencer
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close()
What is the intent of IndexSearcher.close()?
I want to know how, in a web app, one can stop a search that's in 
progress - use case is a user is limited to one search at at time, and 
when one (expensive) search is running they decide it's taking too long 
so they elaborate on the query and resubmit it. Goal is for the server 
to stop the search that's in progress and to start a new one. I know how 
to deal w/ session vars and so on in a web container - but can one stop 
a search that's in progress and is that the intent of close()?

I haven't done the obvious experiment but regardless, the javadoc is 
kinda terse so I wanted to hear from the all knowing people on the list.

thx,
  Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: about search sorting

2004-09-03 Thread David Spencer
Wermus Fernando wrote:
 
Luceners,
My app is creating, updating and deleting from the index and searching
too. I need some information about sorting by a field. Does any one
could send me a link related to sorting?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Sort.html
 
Thanks in advance.
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Result

2004-07-02 Thread David Spencer
Hetan Shah wrote:
My search results are only displaying the top portion of the indexed 
documents. It does match the query in the later part of the document. 
Where should I look to change the code in demo3 of default 1.3 final 
distribution. In general if I want to show the block of document that 
matches with the query string which classes should I use?
Sounds like this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#DEFAULT_MAX_FIELD_LENGTH
Thanks guys.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Running OutOfMemory while optimizing and searching

2004-07-02 Thread David Spencer
This in theory should not help, but anyway, just in case, the idea is to 
call gc() periodically to "force" gc - this is the code I use which 
tries to force it...

public static long gc()
{
long bef = mem();
System.gc();
sleep( 100);
System.runFinalization();
sleep( 100);
System.gc();
long aft= mem();
return aft-bef;
}
Mark Florence wrote:
Thanks, Jim. I'm pretty sure I'm throwing OOM for real,
and not because I've run out of file handles. I can easily
recreate the latter condition, and it is always reported
accurately. I've also monitored the OOM as it occurs using
"top" and I can see memory usage climbing until it is
exhausted -- if you will excuse the pun!
I'm not familiar with the new compound file format. Where
can I look to find more information?
-- Mark
-Original Message-
From: James Dunn [mailto:[EMAIL PROTECTED]
Sent: Friday, July 02, 2004 01:29 pm
To: Lucene Users List
Subject: Re: Running OutOfMemory while optimizing and searching
Ah yes, I don't think I made that clear enough.  From
Mark's original post, I believe he mentioned that he
used seperate readers for each simultaneous query.
His other issue was that he was getting an OOM during
an optimize, even when he set the JVM heap to 2GB.  He
said his index was about 10.5GB spread over ~7000
files on Linux.  

My guess is that OOM might actually be a "too many
open files" error.  I have seen that type of error
being reported by the JVM as an OutOfMemory error on
Linux before.  I had the same problem but once I
switched to the new Lucene compound file format, I
haven't had that problem since.  

Mark, have you tried switching to the compound file
format?  

Jim

--- Doug Cutting <[EMAIL PROTECTED]> wrote:
> What do your queries look like?  The memory
required
> for a query can be computed by the following
equation:
>
> 1 Byte * Number of fields in your query * Number
of
> docs in your index
>
> So if your query searches on all 50 fields of
your 3.5
> Million document index then each search would
take
> about 175MB.  If your 3-4 searches run
concurrently
> then that's about 525MB to 700MB chewed up at
once.
That's not quite right.  If you use the same
IndexSearcher (or 
IndexReader) for all of the searches, then only
175MB are used.  The 
arrays in question (the norms) are read-only and can
be shared by all 
searches.

In general, the amount of memory required is:
1 byte * Number of searchable fields in your index *
Number of docs in 
your index

plus
1k bytes * number of terms in query
plus
1k bytes * number of phrase terms in query
The latter are for i/o buffers.  There are a few
other things, but these 
are the major ones.

Doug

-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Visualization of Lucene search results with a treemap

2004-07-01 Thread David Spencer
Stefan Groschupf wrote:
Dave,
cool stuff, think aboout to contribute that to nutch.. ;-)!
Well the code is very generic - basically 1 method that takes a 
Searcher, a Query, the # of cells to show, and the size of the diagram. 
Technically I think it would be a Lucene sandbox contribution - but - 
for my site I do want to convert the custom spider/cache to use Nutch...

Do you know:
http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html ?
Interesting - is there any code avail to draw the maps?
thx,
 Dave
Cheers,
Stefan
Am 01.07.2004 um 23:28 schrieb David Spencer:
Inspired by these guys who put results from Google into a treemap...
http://google.hivegroup.com/
I did up my own version running against my index of OSS/javadoc trees.
This query for "thread pool" shows it off nicely:
http://www.searchmorph.com/kat/tsearch.jsp? 
s=thread%20pool&side=300&goal=500

This is the empty search form:
http://www.searchmorph.com/kat/tsearch.jsp
And the weblog entry has a few more links, esp useful if you don't  
know what a treemap is:

http://searchmorph.com/weblog/index.php?id=18
Oh: As a start, a treemap is a visualization technique, not  
java.util.Treemap. Bigger boxes show a higher score, and x,y location  
has no significance.

Enjoy,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: search multiple indexes

2004-07-01 Thread David Spencer
Stefan Groschupf wrote:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ 
MultiSearcher.html

100% Right.
I personal found code samples more interesting then just java doc.
Good point.
That why my hint, here the code snippet from nutch:
But - warning - in normal use of Lucene you don't need the Similarity 
stuff..
/** Construct given a number of indexed segments. */
  public IndexSearcher(File[] segmentDirs) throws IOException {
NutchSimilarity sim = new NutchSimilarity();
Searchable[] searchables = new Searchable[segmentDirs.length];
segmentNames = new String[segmentDirs.length];
for (int i = 0; i < segmentDirs.length; i++) {
  org.apache.lucene.search.Searcher searcher =
new org.apache.lucene.search.IndexSearcher
(new File(segmentDirs[i], "index").toString());
  searcher.setSimilarity(sim);
  searchables[i] = searcher;
  segmentNames[i] = segmentDirs[i].getName();
}
this.luceneSearcher = new MultiSearcher(searchables);
this.luceneSearcher.setSimilarity(sim);
  }
Kent Beck said: "Monkey see, Monkey do." ;-)
Cheers,
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: search multiple indexes

2004-07-01 Thread David Spencer
Stefan Groschupf wrote:
Possibly a silly question - but how would I go about searching multiple
indexes using lucene?  Do I need to basically repeat the code I use to
search one index for each one, or is there a better way to do it?

Take a look to the nutch.org sourcecode. It does what you are searching 
Isn't the answer closer to home?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/MultiSearcher.html
for.
HTH
Stefan
---
enterprise information technology consulting
open technology:   http://www.media-style.com
open source:   http://www.weta-group.net
open discussion:http://www.text-mining.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Visualization of Lucene search results with a treemap

2004-07-01 Thread David Spencer
Inspired by these guys who put results from Google into a treemap...
http://google.hivegroup.com/
I did up my own version running against my index of OSS/javadoc trees.
This query for "thread pool" shows it off nicely:
http://www.searchmorph.com/kat/tsearch.jsp?s=thread%20pool&side=300&goal=500
This is the empty search form:
http://www.searchmorph.com/kat/tsearch.jsp
And the weblog entry has a few more links, esp useful if you don't know 
what a treemap is:

http://searchmorph.com/weblog/index.php?id=18
Oh: As a start, a treemap is a visualization technique, not 
java.util.Treemap. Bigger boxes show a higher score, and x,y location 
has no significance.

Enjoy,
  Dave

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Making a case for Lucene

2004-06-30 Thread David Spencer
Alex McManus wrote:
Hi,
 
we are at the initial design stages of a public-facing web-based search
application for a U.S. Federal Agency. We have proposed a clustered Lucene
architecture as the best technical solution, as we feel their current system
(based on Oracle) won't give the best performance, and introduces a lot of
unnecessary complexity and expense (as the system is read-only). We also
feel that the Lucene design will be very flexible, easier to maintain and
administor.
 
Government agencies are notoriously conservative when it comes to decisions
about technology, especially when open-source is involved. Perhaps
surprisingly, their response has been encouraging. However, they want
further re-assurance that other big-name organizations have successfully
used Lucene for large datasets.
 
First some background: we will be searching a number of repositories, the
largest of which includes about 600,000 documents, and might reach 10
million over the next 10 years. The documents are probably comparable to web
pages in terms of average size, and would be indexed under about 10
different fields. Our plan is to partition the indexes and distribute then
over a number of modern Intel/Linux servers.
 
From what I pick up on the mailing lists, this seems well within the
capabilities of Lucene. I've looked at the Powered-by Lucene pages, but
there are two problems: (i) there are no details on the size of the datasets
being searched; (ii) I don't think our customer would recognize any of these
organizations.
 
In Otis' OnJava article, he list "FedEx, Overture, Mayo Clinic, Hewlett
Packard, New Scientist magazine, Epiphany, and others using, or at least
evaluating, Lucene". This is more like it(!), but I want to be honest and
open with our customer, and the "or at least evaluating" comment is not
concrete enough, and there is no idea of scale.

The best example that I've been able to find is the Yahoo research lab - as
I understand it, this is a Nutch (i.e. Lucene) implementation that's
providing impressive performance over a 100 million document repository.
I would be very grateful if anyone could pass on some basic details of
successful large-scale Lucene projects, and even more so if they involve a
"big name" or government agency. If you are happy to pass this information
on, but would prefer to keep it off the public mailing list, then please
email me directly - I will respect confidentiality.
I think that this problem of re-assuring customers/managers is a common one,
so I would be happy to collate any responses to this as a new Wiki entry.
Hopefully one day (with their permission) we will be able to add our
customer to the Powered-by Lucene page too.
Are you interested in the opposing view? Lucene is essentially a 
"library", not an application. You have to create a whole app around it 
including admin interfaces, training classes, etc. Something like Verity 
has the whole infrastructure - front end, admin i/f, training classes, 
consultants, machine sizing guides. The cost for a commercial solution 
seems easier to scope- how much does Verity (or whoever) charge? With 
some Lucene/OSS solution there's development that would need to be done 
and all the stuff that makes a complete, comprehensive app...

Of course if cost is a big issue then trying some Lucene++ solution 
makes sense. And don't get me wrong...I'm a big fan of Lucene, and when 
I was CTO at Lumos Technologies, we were cost sensitive, and I took care 
of making sure we used Lucene on the intranet and our public sites and 
were very happy w/ it. I'm just trying to make sure the, um, the, or 
one, "business viewpoint" is statedcommercial solutions already 
exist and are complete and they work and there's no development to do, 
so there's less risk... Where I'm at now I brought it up to the VP/MIS 
however it just didn't make sense for the way that dept was run, thus 
we're starting to use Verity.

- Dave

 
Thanks in advance,
 
Alex McManus ([EMAIL PROTECTED])

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


ANN: Experimental site for searching javadoc of OSS projects

2004-06-25 Thread David Spencer
I've put together a kind of experimental site which indexes the javadoc 
of OSS java projects (well, plus the JDK).

http://www.searchmorph.com/
This is meant to solve the problem where a java developer knows 
something has been done before, but where, in what project - source 
forge? jakarta? eclipse? jboss?.

There are at least 2 somewhat unique things here. I use a custom 
analyzer ("JavadocAnalyzer") which I recently mentioned on this list in 
another context. With it searches for something like "thread pool" will 
match tokens like "SyncThreadPool" or "Sync_ThreadPool".

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1731360
There's also an AIM (AOL) IM bot running. You send it a query and it 
sends you back 5 URLs of matches - web search w/o a browser.

Also inside - it does query expansion so that query terms are checked 
against multiple fields (may be similar to what nutch does).

And I also use the MoreLikeThis query expansion code I wrote - from a 
results page you can find similar URLs to the hits you see. [BTW: this 
doesn't seem to have made it into the sandbox...]

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1353138
The about page is here:
http://www.searchmorph.com/weblog/index.php?id=7
And the "technology inside" page elaborates a bit more:
http://www.searchmorph.com/weblog/index.php?id=3
I'm interested in feedback. Does it find matches you expect, and what 
other packages should I index?

thx,
Dave
PS
Surely this has been done before - what's the "competition" - any other 
similar specialized search engines?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Various kind of queries

2004-06-24 Thread David Spencer
Hetan Shah wrote:
Hello,
You guys have been great! I read lots of threads and am learning a lot 
about Lucene.

Can any one point me to right direction or show me a code sample where 
I can build queries for  'any word' 'all words' and 'phrase. I tried 
to look on the Lucene FAQ but I did not understand how to go about it. 
Any code sample would be really helpful. I am guessing that I would 
have to use PhraseQuery and BooleanQuery and QueryParser classes. Not 
sure about the syntax.

You mean build in code right? Thus you won't need QueryParser. Here's 
code I posted a while ago:

http://java2.5341.com/msg/51792.html
Note: is this the only archive of lucene-user? I thought it used to be 
somewhere else...

Thanks.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
William W wrote:
Hi,
Carrot seems to be very interesting but I didn't find a simple example :(
I will try to use it ! :)
I can't find an example either, but after going through their source I 
think the heart of it is

com.dawidweiss.carrot.filter.stc.algorithm.STCEngine
and  com.dawidweiss.carrot.filter.stc.Processor is a class that drives this.
Lucene hook - hey - I'm trying to integrate the two. I think this is how 
it would be done, get search results from Lucene then set up STCEngine a 
la how Processor does.


Thx,
william.

From: David Spencer <[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: Lucene Users List <[EMAIL PROTECTED]>
Subject: carrot2 - Re: Categorization
Date: Wed, 23 Jun 2004 11:50:22 -0700
Otis Gospodnetic wrote:
Hello William,
Lucene does not have a categorization engine, but you may want to look
at Carrot2 (http://sourceforge.net/projects/carrot2/)

May be getting off topic - but maybe not..I can't find an example of 
how to use Carrot2. It builds easy enough, but there's no obvious 
example what it takes as input ("documents"?) and what it returns as 
output (some list of clustered docs?). I want to use the "local 
interface" to it and hook it into Lucene.

thx,
 Dave

Otis
--- William W <[EMAIL PROTECTED]> wrote:
Hi,
How can I do a categorization of the results ? Is it possible with
the Lucene API ?
Thanks,
William.
_
Watch the online reality show Mixed Messages with a friend and enter
to win a trip to NY
http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/ 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Get fast, reliable Internet access with MSN 9 Dial-up – now 3 months 
FREE! http://join.msn.click-url.com/go/onm00200361ave/direct/01/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
Otis Gospodnetic wrote:
Hello William,
Lucene does not have a categorization engine, but you may want to look
at Carrot2 (http://sourceforge.net/projects/carrot2/)
May be getting off topic - but maybe not..I can't find an example of how 
to use Carrot2. It builds easy enough, but there's no obvious example 
what it takes as input ("documents"?) and what it returns as output 
(some list of clustered docs?). I want to use the "local interface" to 
it and hook it into Lucene.

thx,
 Dave

Otis
--- William W <[EMAIL PROTECTED]> wrote:
Hi,
How can I do a categorization of the results ? Is it possible with
the 
Lucene API ?
Thanks,
William.

_
Watch the online reality show Mixed Messages with a friend and enter
to win 
a trip to NY 

http://www.msnmessenger-download.click-url.com/go/onm00200497ave/direct/01/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fix for "advanced tokenizers and highlighter" problem

2004-06-22 Thread David Spencer
[EMAIL PROTECTED] wrote:
I think this version of the highlighter should provide a fix: http://www.inperspective.com/lucene/hilite2beta.zip
Before I update the version of the highlighter in the sandbox I'd appreciate feedback from those troubled 
with the issues to do with overlapping tokens in token streams (Erik, Dave, Bruce?)
1st pass of testing - yes, this does indeed fix the problem.
I've realized I may want to modify my Analyer now too.
I was focusing on the Token position increment instead of the offset.
For something like the case where I broken "HashMap" into 3 tokens: 
"Hash", "Map", "HashMap", I was returning the same start/end offsets for 
 all of them (thus a search on "Map" ends up with all of "HashMap" 
being highlighted). Probably more correct is to return offsets within 
the orig larger token so that you can see exactly where your term 
matched. I'll update my code and then put up a site that demonstrates this.

thx,
 Dave

I added my own test analyzer to the Junit test that introduces synonyms into the 
token stream at the same
position as the trigger token and the new code works OK for me with that analyzer.
The fix means I needed to change the Formatter interface - this now takes a "TokenGroup" object instead 
of a token because that can be used to represent a single token OR a sequence of overlapping tokens.
I dont think most people have needed to create custom Formatter implementations so I dont think this
redefined interface should break too much existing code (if any).

Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
Erik Hatcher wrote:
On Jun 19, 2004, at 2:29 AM, David Spencer wrote:
A naive analyzer would turn something like "SyncThreadPool" into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a "0" position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so 
"SyncThread" and "ThreadPool" appear too].

The point behind this is someone searching for "threadpool" probably 
would want to see a match for "SyncThreadPool" even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me 
to no end.

There are indexing/querying solutions/workarounds to the 
leading-prefix issue, such as reversing the text as you index it and 
ensuring you do the same on queries so they match.  There are some 
interesting techniques for this type of thing in the Managing 
Gigabytes book I'm currently reading, which Lucene could support with 
custom analysis and queries, I believe.
Yeah, great book. I thought my approach fit into Lucene the most 
naturally for my goals - and no doubt, things like just having the 
possibility of different pos increments is a great concept that I 
haven't seen in other search engines. I keep meaning to try an idea that 
appeared on the list some months ago, bumping up the incr between 
sentences so that it's harders for, say, a 2 word phrase to match w/ 1 
word in each sentence (makes sense to a computer, but usually not what a 
human wants).  Another side project...


The problem is as follows. In all cases I use my Analyzer to index 
the documents.
If I use my Analyzer with with the Highligher package,  it doesn't 
look at the position increment of the tokens and consequently a 
nonsense stream of matches is output. If I use a different Analyzer 
w/ the highlighter (say, the StandardAnalyzer), then it doesn't show 
the matches that really matched, as it doesn't see the "subtokens".

Are your "subtokens" marked with correct offset values?  This probably 
doesn't relate to the problem you're seeing, but I'm curious.

I think so but this is the first time I've done this kind of thing. When 
I hit the special case several of the "subtokens" are 1st returned w/ an 
incr of 0, then the normal token, w/ an incr of 1 - which seems to make 
sense to me at least.


It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an 
incr of 0 and match one part of the query.

Has this come up before and is the issue clear?

The problem is clear, and I've identified this issue with my 
exploration of the Highlighter also.  The Highlighter works well for 
the most common scenarios, but certainly doesn't cover all the bases.  
The majority of scenarios do not use multiple tokens in a single 
position.  Also, it also doesn't currently handle the new SpanQuery 
family - although Highlighting spans would be quite cool.  After 
learning how Highlighter works, I have a deep appreciation for the 
great work Mark put into it - it is well done.

As for this issue, though, I think your solution sounds reasonable, 
although I haven't thought it through completely.  Perhaps Mark can 
comment.  If you do modify it to work for your case, it
Oh sure, I'll post any changes but wait for Mark for now.
would be great to have your contribution rolled back in :)
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
[EMAIL PROTECTED] wrote:
Yes, this issue has come up before with other choices of analyzers.
I think it should be fixable without changing any of the highlighter APIs 
- can you email me or post here the source to your analyzer?
 

Code attached - don't make fun of it please :) - very prelim. I think it 
only uses one other file, (TRQueue) also attached (but: note, it's in a 
different package). Also any comments in the code may be inaccurate. The 
general goal is as stated in my earlier mail, examples are:

AlphaBeta ->
Alpha (incr 0)
Beta (incr 0)
AlphaBeta (incr 1)
MAX_INT ->
MAX (incr 0)
INT (incr 0)
MAX_INT (incr 1)
thx,
Dave
Cheers
Mark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


package com.tropo.lucene;

import org.apache.lucene.analysis.*;
import java.io.*;
import java.util.*;
import com.tropo.util.*;
import java.util.regex.*;
/**
 * Try to parse javadoc better than othe analyzers.
 */
public final class JavadocAnalyzer
extends Analyzer
{

// [A-Za-z0-9._]+
// 
public final TokenStream tokenStream( String fieldName, Reader reader)
{
return new LowerCaseFilter( new JStream( fieldName, reader));
}

/**
 * Try to break up a token into subset/subtokens that might be said to occur 
in the same place.
 */
public static List breakup( String s)
{
// "a" -> null
// "alphaBeta" -> "alpha", "Beta"
// "XXAlpha" -> ?, Alpha
// BIG_NUM -> "BIG", "NUM"

List lis = new LinkedList();

Matcher m;

m = breakupPattern.matcher( s);
while (m.find())
{
String g = m.group();
if ( ! g.equals( s))
lis.add( g);
}

// hard ones
m = breakupPattern2.matcher( s);
while (m.find())
{
String g;
if ( m.groupCount() == 2) // wierd XXFoo case
g = m.group( 2);
else
g = m.group();
if ( ! g.equals( s))
lis.add( g);
/*
o.println( "gc: " + m.groupCount() +
   "/" + m.group( 0) + "/" + m.group( 1) + "/" 
+ m.group( 2));
*/
//lis.add( m.group());
}   
return lis;
}   


/**
 *
 */
private static class JStream
extends TokenStream
{
private TRQueue q = new TRQueue();
private Set already = new HashSet();
private String fieldName;
private PushbackReader pb;

private StringBuffer sb = new StringBuffer( 32);
private int offset;

// eat white
// have 
private int state = 0;


/**
 *
 */
private JStream( String fieldName, Reader reader)
{
this.fieldName = fieldName;
pb = new PushbackReader( reader);
}


/**
 *
 */
public Token next()
throws IOException
{
if ( q.size() > 0) // pre-calculated
return (Token) q.dequeue();
int c;
int start = offset;
sb.setLength( 0);
offset--;
boolean done = false;
String type = "mystery";
state = 0;

while ( ! done &&
( c = pb.read()) != -1)
{
char ch = (char) c;
offset++;
switch( state)
{
case 0:
if ( Character.isJavaIdentifierStart( ch))
{
start = offset;
sb.append( ch);
state = 1;
type = "id";
}
else if ( Character.isDigit( ch))
   

amusing interaction between advanced tokenizers and highlighter package

2004-06-18 Thread David Spencer
I've run across an amusing interaction between advanced 
Analyzers/TokenStreams and the very useful "term highlighter": 
http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/

I have a custom Analyzer I'm using to index javadoc-generated web pages.
The Analyzer in turn has a custom TokenStream which tries to more 
intelligently tokenize java-language tokens.

A naive analyzer would turn something like "SyncThreadPool" into one 
token. Mine uses the great Lucene capability of Tokens being able to 
have a "0" position increment to turn it into the token stream:

Sync   (incr = 0)
Thread (incr = 0)
Pool (incr = 0)
SyncThreadPool (incr = 1)
[As an aside maybe it should also pair up the subtokens, so "SyncThread" 
and "ThreadPool" appear too].

The point behind this is someone searching for "threadpool" probably 
would want to see a match for "SyncThreadPool" even this is the evil 
leading-prefix case. With most other Analyzers and ways of forming a 
query this would be missed, which I think is anti-human and annoys me to 
no end.

So the analyzer/tokenizer works great, and I have a demo site about to 
come up that indexes lots of publicly avail javadoc as a kind of 
resource so you can easily find what's already been done.

The problem is as follows. In all cases I use my Analyzer to index the 
documents.
If I use my Analyzer with with the Highligher package,  it doesn't look 
at the position increment of the tokens and consequently a nonsense 
stream of matches is output. If I use a different Analyzer w/ the 
highlighter (say, the StandardAnalyzer), then it doesn't show the 
matches that really matched, as it doesn't see the "subtokens".

It might be the fix is for the Highlighter to look at the position 
increment of tokens and only pass by one if multiple ones have an incr 
of 0 and match one part of the query.

Has this come up before and is the issue clear?
thx,
Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


extensible query parser - Re: Proximity Searches behavior

2004-06-09 Thread David Spencer
Erik Hatcher wrote:
On Jun 9, 2004, at 12:21 PM, David Spencer wrote:
show us that most folks query with 1 - 3 words and do not use the 
any of the advanced features.

But with automagic query expansion these things might be done behind 
the scenes.  Nutch, for one, expands simple queries to check against 
multiple fields, with different boosts, and even gives a bonus for 
terms that are near each other.

Ah yes!  Don't worry, I hadn't forgotten about Nutch.  I'm tinkering 
with its query parsing and analysis as we speak in fact.  Very clever 
indeed.

The elegance of the query syntax is quite important, and QueryParser 
has gotten a bit hairy.  I would enjoy discussions on creating new 
query parsers (one size doesn't fit all, I don't think) and what syntax

I suggested in some email a while ago making the QueryParser 
extensible at, runtime or startup time, so you can add other types if 
queries that it doesn't support - so you have a way of registering 
these other query types (SpanQuery, SubstringQuery etc) and then some 
syntax like "span:foo" to invoke the query expander registered w/ 
"span" on "foo"...

I would be curious to see how an implementation of this played out.  
For example, could I add my own syntax such that

"some phrase" <-3-> "another phrase"
could be parsed into a SpanNearQuery of two SpanNearQuery's?
I like the idea of a flexible run-time grammar, but it sounds too good 
to be true in a general purpose kinda way.
My idea isn't perfect for humans, but at least lets you use queries not 
hard coded.

You have something like
[1] how you register, could be in existing QueryParser
void register( String name,  SubqueryParser qp)
[2] what you register
interface SubQueryParser
{
Query parse( String s); // parses string user enters, forms a Query...
}
[3] example of registration
register( "substring", new SubstringQP());  // instead of prefix matches 
allows term anywhere
register( "span", new SurroundQP());
register( "syn", new SynonymExpanderQP()); // expands a word to include 
synonyms

[4]  syntax
normal query parser syntax but add something else like "NAME::TEXT" 
(note 2 colons) so

this:  "black syn::bird"
expands to calls in the new extensible query parser,  something like
BooleanQuery bq = ...
bq.add( new TermQuery( "contents", "black"))
bq.add( SubstringParser.parse( "bird")) // really SynonymExpanderQP
return bq
behind the scenes SynonymExpanderQP expanded "bird" to the query 
equivalent of, um, "bird avian^.5 wingedanimal^.5" or whatnot.

[5] the point
Be backward  compatible and "natural" for existing query syntax, but 
leave a hook so that if you innovate and define new query expansion code 
there's some hope of someone using it as they can in theory drop it in 
and use it w/o coding. Right now if you create some code in this area I 
suspect there's little chance people will try it out as there's too much 
friction to try it out.








Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Proximity Searches behavior

2004-06-09 Thread David Spencer
Erik Hatcher wrote:
On Jun 9, 2004, at 8:53 AM, Terry Steichen wrote:
3) Is there a plan for adding QueryParser support for the SpanQuery 
family?

Another important facet to Terry's question here is what syntax to use 
to express all various types of queries?  I suspect that Google stats
And others - Altavista logs I think...
show us that most folks query with 1 - 3 words and do not use the any 
of the advanced features.

But with automagic query expansion these things might be done behind the 
scenes.  Nutch, for one, expands simple queries to check against 
multiple fields, with different boosts, and even gives a bonus for terms 
that are near each other.

The elegance of the query syntax is quite important, and QueryParser 
has gotten a bit hairy.  I would enjoy discussions on creating new 
query parsers (one size doesn't fit all, I don't think) and what syntax

I suggested in some email a while ago making the QueryParser extensible 
at, runtime or startup time, so you can add other types if queries that 
it doesn't support - so you have a way of registering these other query 
types (SpanQuery, SubstringQuery etc) and then some syntax like 
"span:foo" to invoke the query expander registered w/ "span" on "foo"...

should be used.
Paul Elschot created a "surround" query parser that he posted about to 
the list in April.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Setting Similarity in IndexWriter and IndexSearcher

2004-06-07 Thread David Spencer
Does it ever make sense to set the Similartity obj in either (only one 
of..) IndexWriter or IndexSearcher? i.e. If I set it in IndexWriter can 
I avoid setting it in IndexSearcher? Also, can I avoid setting it in 
IndexWriter and only set it in IndexSearcher? I noticed Nutch sets it in 
both places and was wondering about what's going on behind the scenes...

thx,
 Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


"No tvx reader"

2004-06-05 Thread David Spencer
Using 1.4rc3.
Running an app that indexes 50k documents (thus it just uses an 
IndexWriter).
One field has that boolean set for it to have a term vector stored for 
it, while other 11 fields don't.

On stdout I see "No tvx file" 13 times.
Glancing thru the src it seems this comes from TermVectorReader.
The generated index seeems fine.
What could be causing this and is this normal?
thx,
Dave
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


bonus for exact case match

2004-06-03 Thread David Spencer
Does anyone have any experiences with giving a bonus for exactly 
matching case in queries?

One use case is in the java world maybe I want to see references to 
"Map" (java.util.Map)  but am not interested in a (geographical) "map".

I believe, in the context of Lucene, one way is to have an Analyzer that 
returns a TokenStream which, in cases where a word has some upper case 
characters, returns the word twice in that position, once as-is and once 
in lower case,  using the magic of Token.getPositionIncrement(). Then 
you'll need a query expander or whatnot which, when given a query like 
"Map", expands it to "Map^2 map".

Thoughts/comments?
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  1   2   >