Term Weights and Clustering

2005-02-23 Thread Owen Densmore
I'm building a TDM (Term Document Matrix) from my lucene index.  As 
part of this, it would be useful to have the document term weights (the 
TF*IDF-weight) if they are already available.  Naturally I can compute 
them, but I suspect they are lurking behind an API I've not discovered 
yet.  Is there an API for getting them?

I'm doing this as a first step in discovering a good set of clustering 
labels.  My data collection is 1200 research papers, all of which have 
good meta data: titles, authors, abstracts, keyphrases and so on.

One source for how to do this is the thesis of Stanislaw Osinski and 
others like it:
http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm
And the Carrot2 project which uses similar techniques.
http://www.cs.put.poznan.pl/dweiss/carrot/

My problem is simple: I need a fairly clear discussion on exactly how 
to generate the labels, and to assign documents to them.  The thesis is 
quite good, but I'm not sure I can reduce it to practice in the 2-3 
days I have to evaluate it!  Lucene has made the TDM easy to calculate, 
but I basically don't know what to do next!

Can anyone comment on whether or not this will work, and if so, suggest 
a quick way to get a demo on the air?  For example, I don't seem to be 
able to ask Carrot2 to do a Google "site" search.  If I could, I could 
simply aim Carrot2 at my collection with a very general search and see 
what clusters it discovers.  This may be a gross misuse of Carrot2's 
clustering anyway, so could easily be a blind alley.

Or is there a different stunt with Lucene that might work?  For 
example, use Lucene to cluster the docs using a batch search where the 
queries are Library of Congress descriptions!  Batch searching is 
*really fast* in Lucene -- I've been able to search the data collection 
against each distinct keyphrase in seconds!

Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Mail Archive Broken?

2005-02-19 Thread Owen Densmore
I just beamed into the archive:
http://mail-archives.apache.org/eyebrowse/SummarizeList?listId=30
..and it only has through Feb 1!
What's up?
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple Keywords/Keyphrases fields

2005-02-15 Thread Owen Densmore
From: Erik Hatcher <[EMAIL PROTECTED]>
Date: February 12, 2005 3:09:15 PM MST
To: "Lucene Users List" 
Subject: Re: Multiple Keywords/Keyphrases fields
The real question to answer is what types of queries you're planning 
on making.  Rather than look at it from indexing forward, consider it 
from searching backwards.

How will users query using those keyword phrases?
Hi Erik.  Good point.
There are two uses we are making of the keyphrases:
	- Graphical Navigation: A Flash graphical browser will allow users to 
fly around in a space of documents, choosing what to be viewing: 
Authors, Keyphrases and Textual terms.  In any of these cases, the 
"closeness" of any of the fields will govern how close they will appear 
graphically.  In the case of authors, we will weight collaboration .. 
how often the authors work together.  In the case of Keyphrases, we 
will want to use something like distance vectors like you show in the 
book using the cosine measure.  Thus the keyphrases need to be separate 
entities within the document .. it would be a bug for us if the terms 
leaked across the separate kephrases within the document.

	- Textual Search: In this case, we will have two ways to search the 
keyphrases.  The first would be like the graphical navigation above 
where searching for "complex system" should require the terms to be in 
a single keyphrase.  The second way will be looser, where we may simply 
pool the keyphrases with titles and abstract, and allow them all to be 
searched together within the document.

Does this make sense?  So the question from the search standpoint is: 
do multiple instances of a field act like there are barriers across the 
instances, or are they somehow treated as a single instance somehow.  
In terms of the closeness calculation, for example, can we get separate 
term vectors for each instance of the keyphrase field, or will we get a 
single vector combining all the keyphrase terms within a single 
document?

I hope this is clear!  Kinda hard to articulate.
Owen
    Erik
On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote:
I'm getting a bit more serious about the final form of our lucene 
index.  Each document has DocNumber, Authors, Title, Abstract, and 
Keywords.  By Keywords, I mean a comma separated list, each entry 
having possibly many terms in a phrase like:
	temporal infomax, finite state automata, Markov chains,
	conditional entropy, neural information processing

I presume I should be using a field "Keywords" which have many 
"entries" or "instances" per document (one per comma separated 
phrase).  But I'm not sure the right way to handle all this.  My 
assumption is that I should analyze them individually, just as we do 
for free text (the Abstract, for example), thus in the example above 
having 5 entries of the nature
	doc.add(Field.Text("Keywords", "finite state automata"));
etc, analyzing them because these are author-supplied strings with no 
canonical form.

For guidance, I looked in the archive and found the attached email, 
but I didn't see the answer.  (I'm not concerned about the dups, I 
presume that is equivalent to a boos of some sort) Does this seem 
right?

Thanks once again.
Owen
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Subject: Multiple equal Fields?
Date: Tue, 17 Feb 2004 12:47:58 +0100
Hi!
What happens if I do this:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "blah"));
Is there a field "foo" with value "blah" or are there two "foo"s 
(actually not
possible) or is there one "foo" with the values "bar" and "blah"?

And what does happen in this case:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
Does lucene store this only once?
Timo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Multiple Keywords/Keyphrases fields

2005-02-12 Thread Owen Densmore
I'm getting a bit more serious about the final form of our lucene 
index.  Each document has DocNumber, Authors, Title, Abstract, and 
Keywords.  By Keywords, I mean a comma separated list, each entry 
having possibly many terms in a phrase like:
	temporal infomax, finite state automata, Markov chains,
	conditional entropy, neural information processing

I presume I should be using a field "Keywords" which have many 
"entries" or "instances" per document (one per comma separated phrase). 
 But I'm not sure the right way to handle all this.  My assumption is 
that I should analyze them individually, just as we do for free text 
(the Abstract, for example), thus in the example above having 5 entries 
of the nature
	doc.add(Field.Text("Keywords", "finite state automata"));
etc, analyzing them because these are author-supplied strings with no 
canonical form.

For guidance, I looked in the archive and found the attached email, but 
I didn't see the answer.  (I'm not concerned about the dups, I presume 
that is equivalent to a boos of some sort) Does this seem right?

Thanks once again.
Owen
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
Subject: Multiple equal Fields?
Date: Tue, 17 Feb 2004 12:47:58 +0100
Hi!
What happens if I do this:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "blah"));
Is there a field "foo" with value "blah" or are there two "foo"s 
(actually not
possible) or is there one "foo" with the values "bar" and "blah"?

And what does happen in this case:
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
doc.add(Field.Text("foo", "bar"));
Does lucene store this only once?
Timo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Unicode Usage

2005-02-11 Thread Owen Densmore
Bingo!  I used the InputStreamReader and that fixed the index.  Boy, 
tough to catch all the holes through which unicode leaks occur!

Owen
From: aurora <[EMAIL PROTECTED]>
Date: February 9, 2005 11:04:35 PM MST
To: lucene-user@jakarta.apache.org
Subject: Re: Lucene Unicode Usage
So you got a utf8 encoded text file. But how do you read the file into 
Java? The default encoding of Java is likely to be something other than 
utf8. Make sure you specify the encoding like:

  InputStreamReader( new FileInputStream(filename), "UTF-8");
--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
From: Andrzej Bialecki <[EMAIL PROTECTED]>
Date: February 10, 2005 2:54:56 AM MST
To: Lucene Users List 
Subject: Re: Lucene Unicode Usage
Owen Densmore wrote:
I'm building an index from a FileMaker database by dumping the data to 
a tab-separated file.  Because the FileMaker output is encoded in 
MacRoman, and uses Mac line separators, I run a script across the tab 
file to clean it up:
tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the Mac \r's to \n's, replaces FileMaker's 
vtabs (for inter-field CRs) with blanks, and runs a character 
converter to build utf-8 data for Java to use.  Looks fine in jEdit 
and BBEdit, both of which understand UTF.
However, it matters how you have read in the files in your Java 
application. Did you use InputStreamReader with the default platform 
encoding (probably 8859-1), or did you specify UTF-8 explicitly?

BUT -- when I look at the indexes created in Lucene using Luke, I get 
unprintable letters!  Writing programs to dump the terms (using Writer
By default Luke uses the standard platform-specific font "dialog". On 
Windows this font doesn't support Unicode glyphs, so you will see just 
blanks (or rectangles). In the upcoming release you will be able to 
select the display font.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene Unicode Usage

2005-02-09 Thread Owen Densmore
I'm building an index from a FileMaker database by dumping the data to 
a tab-separated file.  Because the FileMaker output is encoded in 
MacRoman, and uses Mac line separators, I run a script across the tab 
file to clean it up:
	tr '\r\v' '\n ' | iconv -f MAC -t UTF-8
This basically converts the Mac \r's to \n's, replaces FileMaker's 
vtabs (for inter-field CRs) with blanks, and runs a character converter 
to build utf-8 data for Java to use.  Looks fine in jEdit and BBEdit, 
both of which understand UTF.

BUT -- when I look at the indexes created in Lucene using Luke, I get 
unprintable letters!  Writing programs to dump the terms (using Writer 
subclasses which handle unicode correctly) shows that indeed the files 
now have odd characters when viewed w/ jEdit and BBEdit.

The analyzer used to build the index looks like:
public class RedfishAnalyser extends Analyzer {
  String[] stopwords;
  public RedfishAnalyser(String[] stopwords) {
this.stopwords = stopwords;
  }
  public RedfishAnalyser() {
this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS;
  }
  public TokenStream tokenStream(String fieldName, Reader reader) {
return new PorterStemFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader))),
   stopwords));
  }
}
Yikes, what am I doing wrong?!  Is the analyzer at fault?  Its about 
the only place where I can see a problem happening.

Thanks for any pointers,
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2005-02-07 Thread Owen Densmore
I would like to be able to analyze my document collection (~1200 
documents) and discover good "buckets" of categories for them.  I'm 
pretty sure this is termed Document Clustering .. finding the emergent 
clumps the documents fall naturally into judging from their term 
vectors.

Looking at the discussion that flared roughly a year ago (last message 
2003-11-12) with the subject Document Clustering, it seems Lucene 
should be able to help with this.  Has anyone had success with this 
recently?

Last year it was suggested Carrot2 could help, and it would even 
produce good labels for the clusters.  Has this proven to be true?  Our 
goal is to use clustering to build a nifty graphic interface, probably 
using Flash.

Thanks for any pointers.
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: PHP-Lucene Integration

2005-02-07 Thread Owen Densmore
Wow, thanks all for the great spectrum of possibilities.  We'll be 
doing a design review in a week or two with the client and we'll find 
out what way would be best for their site.  I'll report back then.

Thanks again, what a group!
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


PHP-Lucene Integration

2005-02-06 Thread Owen Densmore
I'm building a lucene project for a client who uses php for their 
dynamic web pages.  It would be possible to add servlets to their 
environment easily enough (they use apache) but I'd like to have 
minimal impact on their IT group.

There appears to be a php java extension that lets php call back & 
forth to java classes, but I thought I'd ask here if anyone has had 
success using lucene from php.

Note: I looked in the Lucene In Action search page, and yup, I bought 
the book and love it!  No examples there tho.  The list archives 
mention that using java lucene from php is the way to go, without 
saying how.  There's mention of a lucene server and a php interface to 
that.  And some similar comments.  But I'm a bit surprised there's not 
a bit more in terms of use of the official java extension to php.

Thanks for the great package!
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Right way to make analyzer

2005-02-03 Thread Owen Densmore
Is this the right way to make a porter analyzer using the standard 
tokenizer?  I'm not sure about the order of the filters.

Owen
class MyAnalyzer extends Analyzer {
  public TokenStream tokenStream(String fieldName, Reader reader) {
return new PorterStemFilter(
new StopFilter(
new LowerCaseFilter(
new StandardFilter(
new StandardTokenizer(reader))),
   StopAnalyzer.ENGLISH_STOP_WORDS));
  }
}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-28 Thread Owen Densmore
I looked at the Carrot2 docs which mentioned dimension reduction via 
singular value decomposition (SVD) .. and other forms too I think.

Question: Does anyone have pointers to successful clustering techniques 
used with lucene?  I'm particularly interested in 2D and 3D graphics as 
well, possibly SOM (Self Organizing Maps).

I'm hoping to combine lucene with a graphical auto-clustering stunt of 
some kind but am not sure how to do it yet.

Owen

From: Akmal Sarhan <[EMAIL PROTECTED]>
Date: January 28, 2005 8:19:03 AM MST
To: Lucene Users List 
Subject: Re: carrot2 question too - Re: Fun with the Wikipedia
Hello,
we have been experimenting with carrot2 and are very pleased so far,
only one issue: there is no release not even an alpha one and the
dependencies seemed to be patched (jama)
is there any intentions to have any releases in the near future?
thanks
Akmal
Am Montag, den 17.01.2005, 10:15 +0100 schrieb Dawid Weiss:
Hi David,
I apologize about the delay in answering this one, Lucene is a busy
mailing list and I had a hectic last week... Again, sorry for belated
answer, hope you still find it useful.
That is awesome and very inspirational!
Yes, I admit what you've done with Wikipedia is quite interesting and
looks very good. I'm also glad you spent some time working out Carrot
integration with Lucene. It works quite nice.
Carrot2 looks very interesting. Wondering if anybody has a list of 
all
the
Technically I don't think carrot2 uses lucene per-se- it's just that 
you
can integrate the two, and ditto for Nutch - it has code that uses 
Carrot2.
Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it
merely takes the output from a query (titles, urls and snippets) and
attempts to cluster them into some sensible groups. I think many 
things
could be improved, the most important of them is fast snippet 
retrieval
   from Lucene because right now it takes 50% of the time of the
clustering; I've seen a post a while ago describing a faster snippet
generation technique, I'm sure that would give clustering a huge boost
speed-wise.

And here's my question. I reread the Carrot2<->Lucene code, esp
Demo.java, and there's this fragment:
// warm-up round (stemmer tables must be read etc).
List clusters = clusterer.clusterHits(docs);
long clusteringStartTime = System.currentTimeMillis();
clusters = clusterer.clusterHits(docs);
long clusteringEndTime = System.currentTimeMillis();
Thus it calls clusterHits() twice.
I don't really understand how to use Carrot2 - but I think the above 
is
just for the sake of benchmarking clusterHits() w/o the effect of 
1-time
initialization - and that there's no benefit of repeatedly calling
clusterHits (where a benefit might be that it can find nested 
clusters
or whatever) - is that right (that there's no benefit)?
No, there is absolutely no benefit from it. It was merely to show 
people
that the clustering needs to be warmed up a bit. I should not have put
it in the code knowing people would be confused by it. You can safely
use clusterHits just once. It will just have a small delay at the 
first
invocation.

Thanks for experimenting. Please BCC me if you have any urgent 
projects
-- I read Lucene's list in batches and my personal e-mail I try to 
keep 
up to date with.

Dawid


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Newbie: Human Readable Stemming, Lucene Architecture, etc!

2005-01-20 Thread Owen Densmore
Hi .. I'm new to the list so forgive a dumb question or two as I get 
started.

We're in the midst of converting a small collection (1200-1500 
currently) of scientific literature to be easily searchable/navigable.  
We'll likely provide both a text query interface as well as a graphical 
way to search and discover.

Our initial approach will be vector based, looking at Latent Semantic 
Indexing (LSI) as a potential tool, although if that's not needed, 
we'll stop at reasonably simple stemming with a weighted document term 
matrix (DTM).  (Bear in mind I couldn't even pronounce most of these 
concepts last week, so go easy if I'm incoherent!)

It looks to me that Lucene has a quite well factored architecture.  I 
should at the very least be able to use the analyzer and stemmer to 
create a good starting point in the project.  I'd also like to leave a 
nice architecture behind in case we or others end up experimenting 
with, or extending, the system.

So a couple of questions:
1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) 
apparently produces non-word stems .. i.e. not really human readable.  
(Example: generate, generates, generated, generating  -> generat) 
Although in typical queries this is not important because the result of 
the search is a document list, it *would* be important if we use the 
stems within a graphical navigation interface.
So the question is: Is there a way to have the stemmer produce 
english
base forms of the words being stemmed?

2 - We're probably using Lucene in ways it was not designed for, such 
as DTM/LSI and graphical clustering and navigation.  Naturally we'll 
provide code for these parts that are not in Lucene.
But the question arises: is this kinda dumb?!  Has anyone stretched 
Lucene's
design center with positive results?  Are we barking up the wrong 
tree?

3 - A nit on hyphenation: Our collection is scientific so has many 
hyphenated words.  I'm wondering about your experiences with 
hyphenation.  In our collection, things like self-organization, 
power-law, space-time, small-world, agent-based, etc. occur often, for 
example.
So the question is: Do folks break up hyphenated words?  If not, do 
you stem the
parts and glue them back together?  Do you apply stoplists to the 
parts?

Thanks for any help and pointers you can fling along,
Owenhttp://backspaces.net/http://redfish.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]