Re: Document Clustering
Hi Owen, Last year it was suggested Carrot2 could help, and it would even produce good labels for the clusters. Has this proven to be true? Yes, Carrot2 should help you with this. The labels it creates highly depend on the quality of the input snippets, but the so-called KWIK snippets (keyword in context) should suffice (see David Spencer's example with Wikipedia). There is one thing, though: what is employed in Carrot2 is an on-line unsupervised clusterer that is designed to work with small number of documents and incomplete descriptions (snippets versus full text documents). It will _not_ work for large document collections (thousands of documents) simply because it was not designed to do that. I guess you could try with up to 500 snippets -- beyond that, you'll be waiting for the result forever. There is a great number of algorithms that can cluster large document collections -- see proceedings from information retrieval conferences for example. As for David's hints: > I'm not sure what the complexity of the algorithm is, but for me ~100 > docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM. Yes, 100 to 200 snippets is optimal with the open source clustering algorithm. We have a refactored and optimized version of the Lingo clusterer that is commercial (it also provides hierarchical clustering capability as an add-on to the open source component). But even the commercial version will only cluster up to 500 -- 1000 snippets. As I said, it was not our goal to cluster document collections, rather to retrieve useful information from preprocessed snippets. Dawid - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Owen Densmore wrote: I would like to be able to analyze my document collection (~1200 documents) and discover good "buckets" of categories for them. I'm pretty sure this is termed Document Clustering .. finding the emergent clumps the documents fall naturally into judging from their term vectors. Looking at the discussion that flared roughly a year ago (last message 2003-11-12) with the subject Document Clustering, it seems Lucene should be able to help with this. Has anyone had success with this recently? Last year it was suggested Carrot2 could help, and it would even produce good labels for the clusters. Has this proven to be true? Our goal is to use clustering to build a nifty graphic interface, probably using Flash. Carrot2 seems to work nicely. Demo here... Search for something like "artificial intelligence" in my Wikipedia Search engine: http://www.searchmorph.com/kat/wikipedia.jsp?s=artificial+intelligence The click on "see clustered results.." link to go here: http://www.searchmorph.com/kat/wikipedia-cluster.jsp?s=artificial%20intelligence And voilla, what seems like decent clusters. I'm not sure what the complexity of the algorithm is, but for me ~100 docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM. I suggest: try it w/ ~100 docs, and if you like what you see, keep increasing the # of docs you give it. You might have to wait a while w/ all 1,200 docs... - Dave Thanks for any pointers. Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
I would like to be able to analyze my document collection (~1200 documents) and discover good "buckets" of categories for them. I'm pretty sure this is termed Document Clustering .. finding the emergent clumps the documents fall naturally into judging from their term vectors. Looking at the discussion that flared roughly a year ago (last message 2003-11-12) with the subject Document Clustering, it seems Lucene should be able to help with this. Has anyone had success with this recently? Last year it was suggested Carrot2 could help, and it would even produce good labels for the clusters. Has this proven to be true? Our goal is to use clustering to build a nifty graphic interface, probably using Flash. Thanks for any pointers. Owen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
> I was basically thinking of using lucene to generate document > vectors, and writing my custom similarity algorithms for measuring > distance. > > I could then run this data through k-means or SOM algorithms for > calculating clusters First of all, I think it would already be great if there was some functionality for simply storing document vectors during the indexing process, so you could later on use IndexSearcher.docTerms(int i) to retrieve a BitSet or an array of floats that are weighted so that frequent terms have lower values. One difficulty I see here is that terms don't seem to have any unique identifiers, guess you'd have to manage those yourself... -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Thanks everyone for the responses and links to resources.. I was basically thinking of using lucene to generate document vectors, and writing my custom similarity algorithms for measuring distance. I could then run this data through k-means or SOM algorithms for calculating clusters Does this sound like i'm on the right track...i'm still just in the *thinking* stage. Marc - Original Message - From: "Alex Aw Seat Kiong" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 11, 2003 5:47 PM Subject: Re: Document Clustering > Hi! > > I'm also interest it. Kindly CC to me the lastest progress of your > clustering project. > > Regards, > AlexAw > > > - Original Message - > From: "Eric Jain" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Tuesday, November 11, 2003 10:07 PM > Subject: Re: Document Clustering > > > > > I'm working on it. Classification and Clustering as well. > > > > Very interesting... if you get something working, please don't forget to > > notify this list :-) > > > > -- > > Eric Jain > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi! I'm also interest it. Kindly CC to me the lastest progress of your clustering project. Regards, AlexAw - Original Message - From: "Eric Jain" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 11, 2003 10:07 PM Subject: Re: Document Clustering > > I'm working on it. Classification and Clustering as well. > > Very interesting... if you get something working, please don't forget to > notify this list :-) > > -- > Eric Jain > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Nov 11, 2003, at 21:32, maurits van wijland wrote: There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ "Leo Galambos, author of the Egothor project, constantly supports us with fresh ideas and includes Carrot components in his own project!" http://www.cs.put.poznan.pl/dweiss/carrot/xml/authors.xml?lang=en Small world :) PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
really cool Stuff!!! maurits van wijland wrote: Hi All and Marc, There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ The carrot system consists of webservices that can easily be fed by a lucene resultlist. You simply have to create a JSP that creates this XML file and create a custom process and input component. The input component for lucene could look like: http://www.dawidweiss.com/projects/carrot/componentDescriptor"; framework = "Carrot2"> http://localhost/weblucene/c2.jsp"; infoURL = "http://localhost/weblucene/"; /> The c2.jsp file simply has to translate a resultlist into an XLM file such as: ... 1.0 http://... sum 1 snip 2 ... 1.0 http://... sum 2 snip 2 Feed this into the carrot system, and you will get a nice clustered result list. The amazing part is of this clustering mechanism is that the cluster labels are incredible, their great! Then there is a open source project called Classifier4J that can be used for classification, the oposite of clustering. These other open source projects are a great addition to the Lucene system. I hope this helps... Marc, what are you building?? Maybe we can help! Kind regards, Maurits - Original Message - From: "marc" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 11, 2003 5:15 PM Subject: Document Clustering Hi, does anyone have any sample code/documentation available for doing document based clustering using lucene? Thanks, Marc - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- day time: www.media-style.com spare time: www.text-mining.org | www.weta-group.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi All and Marc, There is the carrot project : http://www.cs.put.poznan.pl/dweiss/carrot/ The carrot system consists of webservices that can easily be fed by a lucene resultlist. You simply have to create a JSP that creates this XML file and create a custom process and input component. The input component for lucene could look like: http://www.dawidweiss.com/projects/carrot/componentDescriptor"; framework = "Carrot2"> http://localhost/weblucene/c2.jsp"; infoURL = "http://localhost/weblucene/"; /> The c2.jsp file simply has to translate a resultlist into an XLM file such as: ... 1.0 http://... sum 1 snip 2 ... 1.0 http://... sum 2 snip 2 Feed this into the carrot system, and you will get a nice clustered result list. The amazing part is of this clustering mechanism is that the cluster labels are incredible, their great! Then there is a open source project called Classifier4J that can be used for classification, the oposite of clustering. These other open source projects are a great addition to the Lucene system. I hope this helps... Marc, what are you building?? Maybe we can help! Kind regards, Maurits - Original Message - From: "marc" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, November 11, 2003 5:15 PM Subject: Document Clustering Hi, does anyone have any sample code/documentation available for doing document based clustering using lucene? Thanks, Marc - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Marcel Stor wrote: Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say that you have to define a "distance threshold" parameter in order to define when to build a new category for a certain group? I'm not sure. There are different data mining algorithms that could be used. Depends on this algoritm. I prefer Support vector machines(SVM). There you calculate distances of multi demensional vectors in a multidemensional "room". One vector represent one document. Stefan
Re: Document Clustering
On Tuesday, Nov 11, 2003, at 11:05 US/Pacific, Marcel Stor wrote: Stefan Groschupf wrote: Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Would I be correct to say that you have to define a "distance threshold" parameter in order to define when to build a new category for a certain group? Depends on the type of clustering algorithm. Some clustering algorithms take the number of clusters as a parameter (in this case the algorithm may be run several times with different values, to determine the best value). Other types of algorithms, such as hierarchical agglomerative clustering algorithms, work more as you suggest. Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Document Clustering
Stefan Groschupf wrote: > Hi, > > How is document clustering different/related to text categorization? > > Clustering: try to find own categories and put documents that match > in it. You group all documents with minimal distance together. Would I be correct to say that you have to define a "distance threshold" parameter in order to define when to build a new category for a certain group? > Classification: you have already categories and samples for > it, that help you to match other documents. > You calculate document distances to the existing categories > and put it in the category with smallest distance. Regards, Marcel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Thanks for the clarification, Stefan. I should have known that... :) Otis --- Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi, > >How is document clustering different/related to text categorization? > > Clustering: try to find own categories and put documents that match > in it. > You group all documents with minimal distance together. > > Classification: you have already categories and samples for it, that > help you to match other documents. > You calculate document distances to the existing categories and put > it in the category with smallest distance. > > Cheers > Stefan > > -- > day time: www.media-style.com > spare time: www.text-mining.org | www.weta-group.net > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __ Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard http://antispam.yahoo.com/whatsnewfree - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi, How is document clustering different/related to text categorization? Clustering: try to find own categories and put documents that match in it. You group all documents with minimal distance together. Classification: you have already categories and samples for it, that help you to match other documents. You calculate document distances to the existing categories and put it in the category with smallest distance. Cheers Stefan -- day time: www.media-style.com spare time: www.text-mining.org | www.weta-group.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Nov 11, 2003, at 16:58, Tate Avery wrote: Categorization typically assigns documents to a node in a pre-defined taxonomy. For clustering, however, the categorization 'structure' is emergent... i.e. the clusters (which are analogous to taxonomy nodes) are created dynamically based on the content of the documents at hand. Another way to look at it is this: "An attempt to apply the Dewey Decimal system to an orgy." [1] Without a Dewey Decimal system that is. Cheers, PA. [1] http://www.eod.com/devil/archive/semantic_web.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Document Clustering
Categorization typically assigns documents to a node in a pre-defined taxonomy. For clustering, however, the categorization 'structure' is emergent... i.e. the clusters (which are analogous to taxonomy nodes) are created dynamically based on the content of the documents at hand. -Original Message- From: petite_abeille [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 11, 2003 10:50 AM To: Lucene Users List Subject: Re: Document Clustering Hi Otis, On Nov 11, 2003, at 16:41, Otis Gospodnetic wrote: > How is document clustering different/related to text categorization? Not that I'm an expert in any of this, but clustering is a much more "holistic" approach than categorization. Usually, categorization is understood as a more precise endeavor (e.g. dmoz.org), while clustering is much more "fuzzy" and non-deterministic. Both try to achieve the same goal though. So perhaps this is just a question of jargon. I'm confident that the owner of this site could help bring some light on the finer point of clustering vs categorization: http://www.lissus.com/resources/index.htm Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi Otis, On Nov 11, 2003, at 16:41, Otis Gospodnetic wrote: How is document clustering different/related to text categorization? Not that I'm an expert in any of this, but clustering is a much more "holistic" approach than categorization. Usually, categorization is understood as a more precise endeavor (e.g. dmoz.org), while clustering is much more "fuzzy" and non-deterministic. Both try to achieve the same goal though. So perhaps this is just a question of jargon. I'm confident that the owner of this site could help bring some light on the finer point of clustering vs categorization: http://www.lissus.com/resources/index.htm Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
On Nov 11, 2003, at 16:05, Marcel Stör wrote: As everybody seems to be so exited about it, would someone please be so kind to explain what "document based clustering" is? This mostly means finding document which are "similar" in some way(s). The "similitude" is mostly in the eyes of the beholder. In such a world, a "cluster" would be a pile of document sharing something. As far as Lucene goes, a straightforward way of approaching this could be to use an entire document content to query an index. Lucene's result set could be construed as a "document cluster". Admittedly, this is ground zero of "document clustering", but here you go anyway :) Here is an illustration: "Patterns in Unstructured Data" Discovery, Aggregation, and Visualization http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
--- Leo Galambos <[EMAIL PROTECTED]> wrote: > Marcel Stör wrote: > > >Hi > > > >As everybody seems to be so exited about it, would someone please be > so kind to explain > >what "document based clustering" is? AFAIK, "document clustering" consists of detection of documents with similar content (similar subjects/topics). > Hi > > they are trying to implement what you can see in the right panel > here: > http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein > They may also analyze identical pages (hit #9 and #10) - this could > be > also taken as "clustering" AFAIK. Intersting. > For instance, Doug wrote some papers about clustering (if I remember > it > correctly) - see his bibliography. How is document clustering different/related to text categorization? Thanks, Otis __ Do you Yahoo!? Protect your identity with Yahoo! Mail AddressGuard http://antispam.yahoo.com/whatsnewfree - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Marcel Stör wrote: Hi As everybody seems to be so exited about it, would someone please be so kind to explain what "document based clustering" is? Hi they are trying to implement what you can see in the right panel here: http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein They may also analyze identical pages (hit #9 and #10) - this could be also taken as "clustering" AFAIK. For instance, Doug wrote some papers about clustering (if I remember it correctly) - see his bibliography. Leo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi As everybody seems to be so exited about it, would someone please be so kind to explain what "document based clustering" is? Regards, Marcel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
> I'm working on it. Classification and Clustering as well. Very interesting... if you get something working, please don't forget to notify this list :-) -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Clustering
Hi Marc, I'm working on it. Classification and Clustering as well. I was planing doing it for nutch.org, but actually some guys there breakup some important basic work I already had done, so may be i will not contribute it there. However it will be open source and I can notice you if something useful is ready. But it will take some weeks. I actually working on radical minimizing of feature selection Cheers Stefan marc wrote: Hi, does anyone have any sample code/documentation available for doing document based clustering using lucene? Thanks, Marc -- day time: www.media-style.com spare time: www.text-mining.org | www.weta-group.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]