Term Weights and Clustering
I'm building a TDM (Term Document Matrix) from my lucene index. As part of this, it would be useful to have the document term weights (the TF*IDF-weight) if they are already available. Naturally I can compute them, but I suspect they are lurking behind an API I've not discovered yet. Is there an API for getting them? I'm doing this as a first step in discovering a good set of clustering labels. My data collection is 1200 research papers, all of which have good meta data: titles, authors, abstracts, keyphrases and so on. One source for how to do this is the thesis of Stanislaw Osinski and others like it: http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm And the Carrot2 project which uses similar techniques. http://www.cs.put.poznan.pl/dweiss/carrot/ My problem is simple: I need a fairly clear discussion on exactly how to generate the labels, and to assign documents to them. The thesis is quite good, but I'm not sure I can reduce it to practice in the 2-3 days I have to evaluate it! Lucene has made the TDM easy to calculate, but I basically don't know what to do next! Can anyone comment on whether or not this will work, and if so, suggest a quick way to get a demo on the air? For example, I don't seem to be able to ask Carrot2 to do a Google "site" search. If I could, I could simply aim Carrot2 at my collection with a very general search and see what clusters it discovers. This may be a gross misuse of Carrot2's clustering anyway, so could easily be a blind alley. Or is there a different stunt with Lucene that might work? For example, use Lucene to cluster the docs using a batch search where the queries are Library of Congress descriptions! Batch searching is *really fast* in Lucene -- I've been able to search the data collection against each distinct keyphrase in seconds! Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Mail Archive Broken?
I just beamed into the archive: http://mail-archives.apache.org/eyebrowse/SummarizeList?listId=30 ..and it only has through Feb 1! What's up? Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Keywords/Keyphrases fields
From: Erik Hatcher <[EMAIL PROTECTED]> Date: February 12, 2005 3:09:15 PM MST To: "Lucene Users List" Subject: Re: Multiple Keywords/Keyphrases fields The real question to answer is what types of queries you're planning on making. Rather than look at it from indexing forward, consider it from searching backwards. How will users query using those keyword phrases? Hi Erik. Good point. There are two uses we are making of the keyphrases: - Graphical Navigation: A Flash graphical browser will allow users to fly around in a space of documents, choosing what to be viewing: Authors, Keyphrases and Textual terms. In any of these cases, the "closeness" of any of the fields will govern how close they will appear graphically. In the case of authors, we will weight collaboration .. how often the authors work together. In the case of Keyphrases, we will want to use something like distance vectors like you show in the book using the cosine measure. Thus the keyphrases need to be separate entities within the document .. it would be a bug for us if the terms leaked across the separate kephrases within the document. - Textual Search: In this case, we will have two ways to search the keyphrases. The first would be like the graphical navigation above where searching for "complex system" should require the terms to be in a single keyphrase. The second way will be looser, where we may simply pool the keyphrases with titles and abstract, and allow them all to be searched together within the document. Does this make sense? So the question from the search standpoint is: do multiple instances of a field act like there are barriers across the instances, or are they somehow treated as a single instance somehow. In terms of the closeness calculation, for example, can we get separate term vectors for each instance of the keyphrase field, or will we get a single vector combining all the keyphrase terms within a single document? I hope this is clear! Kinda hard to articulate. Owen Erik On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote: I'm getting a bit more serious about the final form of our lucene index. Each document has DocNumber, Authors, Title, Abstract, and Keywords. By Keywords, I mean a comma separated list, each entry having possibly many terms in a phrase like: temporal infomax, finite state automata, Markov chains, conditional entropy, neural information processing I presume I should be using a field "Keywords" which have many "entries" or "instances" per document (one per comma separated phrase). But I'm not sure the right way to handle all this. My assumption is that I should analyze them individually, just as we do for free text (the Abstract, for example), thus in the example above having 5 entries of the nature doc.add(Field.Text("Keywords", "finite state automata")); etc, analyzing them because these are author-supplied strings with no canonical form. For guidance, I looked in the archive and found the attached email, but I didn't see the answer. (I'm not concerned about the dups, I presume that is equivalent to a boos of some sort) Does this seem right? Thanks once again. Owen From: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Subject: Multiple equal Fields? Date: Tue, 17 Feb 2004 12:47:58 +0100 Hi! What happens if I do this: doc.add(Field.Text("foo", "bar")); doc.add(Field.Text("foo", "blah")); Is there a field "foo" with value "blah" or are there two "foo"s (actually not possible) or is there one "foo" with the values "bar" and "blah"? And what does happen in this case: doc.add(Field.Text("foo", "bar")); doc.add(Field.Text("foo", "bar")); doc.add(Field.Text("foo", "bar")); Does lucene store this only once? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple Keywords/Keyphrases fields
I'm getting a bit more serious about the final form of our lucene index. Each document has DocNumber, Authors, Title, Abstract, and Keywords. By Keywords, I mean a comma separated list, each entry having possibly many terms in a phrase like: temporal infomax, finite state automata, Markov chains, conditional entropy, neural information processing I presume I should be using a field "Keywords" which have many "entries" or "instances" per document (one per comma separated phrase). But I'm not sure the right way to handle all this. My assumption is that I should analyze them individually, just as we do for free text (the Abstract, for example), thus in the example above having 5 entries of the nature doc.add(Field.Text("Keywords", "finite state automata")); etc, analyzing them because these are author-supplied strings with no canonical form. For guidance, I looked in the archive and found the attached email, but I didn't see the answer. (I'm not concerned about the dups, I presume that is equivalent to a boos of some sort) Does this seem right? Thanks once again. Owen From: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Subject: Multiple equal Fields? Date: Tue, 17 Feb 2004 12:47:58 +0100 Hi! What happens if I do this: doc.add(Field.Text("foo", "bar")); doc.add(Field.Text("foo", "blah")); Is there a field "foo" with value "blah" or are there two "foo"s (actually not possible) or is there one "foo" with the values "bar" and "blah"? And what does happen in this case: doc.add(Field.Text("foo", "bar")); doc.add(Field.Text("foo", "bar")); doc.add(Field.Text("foo", "bar")); Does lucene store this only once? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Unicode Usage
Bingo! I used the InputStreamReader and that fixed the index. Boy, tough to catch all the holes through which unicode leaks occur! Owen From: aurora <[EMAIL PROTECTED]> Date: February 9, 2005 11:04:35 PM MST To: lucene-user@jakarta.apache.org Subject: Re: Lucene Unicode Usage So you got a utf8 encoded text file. But how do you read the file into Java? The default encoding of Java is likely to be something other than utf8. Make sure you specify the encoding like: InputStreamReader( new FileInputStream(filename), "UTF-8"); -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ From: Andrzej Bialecki <[EMAIL PROTECTED]> Date: February 10, 2005 2:54:56 AM MST To: Lucene Users List Subject: Re: Lucene Unicode Usage Owen Densmore wrote: I'm building an index from a FileMaker database by dumping the data to a tab-separated file. Because the FileMaker output is encoded in MacRoman, and uses Mac line separators, I run a script across the tab file to clean it up: tr '\r\v' '\n ' | iconv -f MAC -t UTF-8 This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs (for inter-field CRs) with blanks, and runs a character converter to build utf-8 data for Java to use. Looks fine in jEdit and BBEdit, both of which understand UTF. However, it matters how you have read in the files in your Java application. Did you use InputStreamReader with the default platform encoding (probably 8859-1), or did you specify UTF-8 explicitly? BUT -- when I look at the indexes created in Lucene using Luke, I get unprintable letters! Writing programs to dump the terms (using Writer By default Luke uses the standard platform-specific font "dialog". On Windows this font doesn't support Unicode glyphs, so you will see just blanks (or rectangles). In the upcoming release you will be able to select the display font. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Unicode Usage
I'm building an index from a FileMaker database by dumping the data to a tab-separated file. Because the FileMaker output is encoded in MacRoman, and uses Mac line separators, I run a script across the tab file to clean it up: tr '\r\v' '\n ' | iconv -f MAC -t UTF-8 This basically converts the Mac \r's to \n's, replaces FileMaker's vtabs (for inter-field CRs) with blanks, and runs a character converter to build utf-8 data for Java to use. Looks fine in jEdit and BBEdit, both of which understand UTF. BUT -- when I look at the indexes created in Lucene using Luke, I get unprintable letters! Writing programs to dump the terms (using Writer subclasses which handle unicode correctly) shows that indeed the files now have odd characters when viewed w/ jEdit and BBEdit. The analyzer used to build the index looks like: public class RedfishAnalyser extends Analyzer { String[] stopwords; public RedfishAnalyser(String[] stopwords) { this.stopwords = stopwords; } public RedfishAnalyser() { this.stopwords = StopAnalyzer.ENGLISH_STOP_WORDS; } public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter( new StopFilter( new LowerCaseFilter( new StandardFilter( new StandardTokenizer(reader))), stopwords)); } } Yikes, what am I doing wrong?! Is the analyzer at fault? Its about the only place where I can see a problem happening. Thanks for any pointers, Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
I would like to be able to analyze my document collection (~1200 documents) and discover good "buckets" of categories for them. I'm pretty sure this is termed Document Clustering .. finding the emergent clumps the documents fall naturally into judging from their term vectors. Looking at the discussion that flared roughly a year ago (last message 2003-11-12) with the subject Document Clustering, it seems Lucene should be able to help with this. Has anyone had success with this recently? Last year it was suggested Carrot2 could help, and it would even produce good labels for the clusters. Has this proven to be true? Our goal is to use clustering to build a nifty graphic interface, probably using Flash. Thanks for any pointers. Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Wow, thanks all for the great spectrum of possibilities. We'll be doing a design review in a week or two with the client and we'll find out what way would be best for their site. I'll report back then. Thanks again, what a group! Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PHP-Lucene Integration
I'm building a lucene project for a client who uses php for their dynamic web pages. It would be possible to add servlets to their environment easily enough (they use apache) but I'd like to have minimal impact on their IT group. There appears to be a php java extension that lets php call back & forth to java classes, but I thought I'd ask here if anyone has had success using lucene from php. Note: I looked in the Lucene In Action search page, and yup, I bought the book and love it! No examples there tho. The list archives mention that using java lucene from php is the way to go, without saying how. There's mention of a lucene server and a php interface to that. And some similar comments. But I'm a bit surprised there's not a bit more in terms of use of the official java extension to php. Thanks for the great package! Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Right way to make analyzer
Is this the right way to make a porter analyzer using the standard tokenizer? I'm not sure about the order of the filters. Owen class MyAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter( new StopFilter( new LowerCaseFilter( new StandardFilter( new StandardTokenizer(reader))), StopAnalyzer.ENGLISH_STOP_WORDS)); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: carrot2 question too - Re: Fun with the Wikipedia
I looked at the Carrot2 docs which mentioned dimension reduction via singular value decomposition (SVD) .. and other forms too I think. Question: Does anyone have pointers to successful clustering techniques used with lucene? I'm particularly interested in 2D and 3D graphics as well, possibly SOM (Self Organizing Maps). I'm hoping to combine lucene with a graphical auto-clustering stunt of some kind but am not sure how to do it yet. Owen From: Akmal Sarhan <[EMAIL PROTECTED]> Date: January 28, 2005 8:19:03 AM MST To: Lucene Users List Subject: Re: carrot2 question too - Re: Fun with the Wikipedia Hello, we have been experimenting with carrot2 and are very pleased so far, only one issue: there is no release not even an alpha one and the dependencies seemed to be patched (jama) is there any intentions to have any releases in the near future? thanks Akmal Am Montag, den 17.01.2005, 10:15 +0100 schrieb Dawid Weiss: Hi David, I apologize about the delay in answering this one, Lucene is a busy mailing list and I had a hectic last week... Again, sorry for belated answer, hope you still find it useful. That is awesome and very inspirational! Yes, I admit what you've done with Wikipedia is quite interesting and looks very good. I'm also glad you spent some time working out Carrot integration with Lucene. It works quite nice. Carrot2 looks very interesting. Wondering if anybody has a list of all the Technically I don't think carrot2 uses lucene per-se- it's just that you can integrate the two, and ditto for Nutch - it has code that uses Carrot2. Yes, this is true. Carrot2 doesn't use all of Lucene's potential -- it merely takes the output from a query (titles, urls and snippets) and attempts to cluster them into some sensible groups. I think many things could be improved, the most important of them is fast snippet retrieval from Lucene because right now it takes 50% of the time of the clustering; I've seen a post a while ago describing a faster snippet generation technique, I'm sure that would give clustering a huge boost speed-wise. And here's my question. I reread the Carrot2<->Lucene code, esp Demo.java, and there's this fragment: // warm-up round (stemmer tables must be read etc). List clusters = clusterer.clusterHits(docs); long clusteringStartTime = System.currentTimeMillis(); clusters = clusterer.clusterHits(docs); long clusteringEndTime = System.currentTimeMillis(); Thus it calls clusterHits() twice. I don't really understand how to use Carrot2 - but I think the above is just for the sake of benchmarking clusterHits() w/o the effect of 1-time initialization - and that there's no benefit of repeatedly calling clusterHits (where a benefit might be that it can find nested clusters or whatever) - is that right (that there's no benefit)? No, there is absolutely no benefit from it. It was merely to show people that the clustering needs to be warmed up a bit. I should not have put it in the code knowing people would be confused by it. You can safely use clusterHits just once. It will just have a small delay at the first invocation. Thanks for experimenting. Please BCC me if you have any urgent projects -- I read Lucene's list in batches and my personal e-mail I try to keep up to date with. Dawid - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Newbie: Human Readable Stemming, Lucene Architecture, etc!
Hi .. I'm new to the list so forgive a dumb question or two as I get started. We're in the midst of converting a small collection (1200-1500 currently) of scientific literature to be easily searchable/navigable. We'll likely provide both a text query interface as well as a graphical way to search and discover. Our initial approach will be vector based, looking at Latent Semantic Indexing (LSI) as a potential tool, although if that's not needed, we'll stop at reasonably simple stemming with a weighted document term matrix (DTM). (Bear in mind I couldn't even pronounce most of these concepts last week, so go easy if I'm incoherent!) It looks to me that Lucene has a quite well factored architecture. I should at the very least be able to use the analyzer and stemmer to create a good starting point in the project. I'd also like to leave a nice architecture behind in case we or others end up experimenting with, or extending, the system. So a couple of questions: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating -> generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? 2 - We're probably using Lucene in ways it was not designed for, such as DTM/LSI and graphical clustering and navigation. Naturally we'll provide code for these parts that are not in Lucene. But the question arises: is this kinda dumb?! Has anyone stretched Lucene's design center with positive results? Are we barking up the wrong tree? 3 - A nit on hyphenation: Our collection is scientific so has many hyphenated words. I'm wondering about your experiences with hyphenation. In our collection, things like self-organization, power-law, space-time, small-world, agent-based, etc. occur often, for example. So the question is: Do folks break up hyphenated words? If not, do you stem the parts and glue them back together? Do you apply stoplists to the parts? Thanks for any help and pointers you can fling along, Owenhttp://backspaces.net/http://redfish.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]