Re: Document Clustering

2005-02-08 Thread Dawid Weiss
Hi Owen,
Last year it was suggested Carrot2 could help, and it would even produce 
good labels for the clusters.  Has this proven to be true?  
Yes, Carrot2 should help you with this. The labels it creates highly 
depend on the quality of the input snippets, but the so-called KWIK 
snippets (keyword in context) should suffice (see David Spencer's 
example with Wikipedia).

There is one thing, though: what is employed in Carrot2 is an on-line 
unsupervised clusterer that is designed to work with small number of 
documents and incomplete descriptions (snippets versus full text 
documents). It will _not_ work for large document collections (thousands 
of documents) simply because it was not designed to do that. I guess
you could try with up to 500 snippets -- beyond that, you'll be waiting 
for the result forever.

There is a great number of algorithms that can cluster large document 
collections -- see proceedings from information retrieval conferences 
for example.

As for David's hints:
> I'm not sure what the complexity of the algorithm is, but for me ~100 
> docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

Yes, 100 to 200 snippets is optimal with the open source clustering 
algorithm. We have a refactored and optimized version of the Lingo 
clusterer that is commercial (it also provides hierarchical clustering 
capability as an add-on to the open source component). But even the 
commercial version will only cluster up to 500 -- 1000 snippets. As I 
said, it was not our goal to cluster document collections, rather to 
retrieve useful information from preprocessed snippets.

Dawid
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2005-02-08 Thread David Spencer
Owen Densmore wrote:
I would like to be able to analyze my document collection (~1200 
documents) and discover good "buckets" of categories for them.  I'm 
pretty sure this is termed Document Clustering .. finding the emergent 
clumps the documents fall naturally into judging from their term vectors.

Looking at the discussion that flared roughly a year ago (last message 
2003-11-12) with the subject Document Clustering, it seems Lucene should 
be able to help with this.  Has anyone had success with this recently?

Last year it was suggested Carrot2 could help, and it would even produce 
good labels for the clusters.  Has this proven to be true?  Our goal is 
to use clustering to build a nifty graphic interface, probably using Flash.
Carrot2 seems to work nicely.
Demo here...
Search for something like "artificial intelligence" in my Wikipedia 
Search engine:

http://www.searchmorph.com/kat/wikipedia.jsp?s=artificial+intelligence
The click on "see clustered results.." link to go here:
http://www.searchmorph.com/kat/wikipedia-cluster.jsp?s=artificial%20intelligence
And voilla, what seems like decent clusters.
I'm not sure what the complexity of the algorithm is, but for me ~100 
docs works ok, maybe 200, but beyond 200 you need lots more CPU and RAM.

I suggest: try it w/ ~100 docs, and if you like what you see, keep 
increasing the # of docs you give it. You might have to wait a while w/ 
all 1,200 docs...

- Dave



Thanks for any pointers.
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2005-02-07 Thread Owen Densmore
I would like to be able to analyze my document collection (~1200 
documents) and discover good "buckets" of categories for them.  I'm 
pretty sure this is termed Document Clustering .. finding the emergent 
clumps the documents fall naturally into judging from their term 
vectors.

Looking at the discussion that flared roughly a year ago (last message 
2003-11-12) with the subject Document Clustering, it seems Lucene 
should be able to help with this.  Has anyone had success with this 
recently?

Last year it was suggested Carrot2 could help, and it would even 
produce good labels for the clusters.  Has this proven to be true?  Our 
goal is to use clustering to build a nifty graphic interface, probably 
using Flash.

Thanks for any pointers.
Owen
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-12 Thread Eric Jain
> I was basically thinking of using lucene to generate document
> vectors, and writing my custom similarity algorithms for measuring
> distance.
>
> I could then run this data through k-means or SOM algorithms for
> calculating clusters

First of all, I think it would already be great if there was some
functionality for simply storing document vectors during the indexing
process, so you could later on use

  IndexSearcher.docTerms(int i)

to retrieve a BitSet or an array of floats that are weighted so that
frequent terms have lower values.

One difficulty I see here is that terms don't seem to have any unique
identifiers, guess you'd have to manage those yourself...

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread marc
Thanks everyone for the responses and links to resources..

I was basically thinking of using lucene to generate document vectors, and
writing my custom similarity algorithms for measuring distance.

I could then run this data through k-means or SOM algorithms for calculating
clusters

Does this sound like i'm on the right track...i'm still just in the
*thinking* stage.

Marc


- Original Message - 
From: "Alex Aw Seat Kiong" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 11, 2003 5:47 PM
Subject: Re: Document Clustering


> Hi!
>
> I'm also interest it. Kindly CC to me the lastest progress of your
> clustering project.
>
> Regards,
> AlexAw
>
>
> - Original Message - 
> From: "Eric Jain" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Tuesday, November 11, 2003 10:07 PM
> Subject: Re: Document Clustering
>
>
> > > I'm working on it. Classification and Clustering as well.
> >
> > Very interesting... if you get something working, please don't forget to
> > notify this list :-)
> >
> > --
> > Eric Jain
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Alex Aw Seat Kiong
Hi!

I'm also interest it. Kindly CC to me the lastest progress of your
clustering project.

Regards,
AlexAw


- Original Message - 
From: "Eric Jain" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 11, 2003 10:07 PM
Subject: Re: Document Clustering


> > I'm working on it. Classification and Clustering as well.
>
> Very interesting... if you get something working, please don't forget to
> notify this list :-)
>
> --
> Eric Jain
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 21:32, maurits van wijland wrote:

There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/
"Leo Galambos, author of the Egothor project, constantly supports us 
with fresh ideas and includes Carrot components in his own project!"

http://www.cs.put.poznan.pl/dweiss/carrot/xml/authors.xml?lang=en

Small world :)

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
really cool Stuff!!!

maurits van wijland wrote:

Hi All and Marc,

There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/
The carrot system consists of webservices that can easily be fed by a lucene
resultlist. You simply have to create a JSP that creates this XML file and
create a custom process and input component. The input component
for lucene could look like:

http://www.dawidweiss.com/projects/carrot/componentDescriptor"; framework  =
"Carrot2">
   http://localhost/weblucene/c2.jsp";
  infoURL  = "http://localhost/weblucene/";
   />

The c2.jsp file simply has to translate a resultlist into an XLM file such
as:

   
...
1.0
http://...
sum 1
snip 2
   
   
...
1.0
http://...
sum 2
snip 2
   

Feed this into the carrot system, and you will get a nice clustered
result list. The amazing part is of this clustering mechanism is that
the cluster labels are incredible, their great!
Then there is a open source project called Classifier4J that can
be used for classification, the oposite of clustering. These other
open source projects are a great addition to the Lucene system.
I hope this helps...

Marc, what are you building?? Maybe we can help!

Kind regards,

Maurits

- Original Message - 
From: "marc" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 11, 2003 5:15 PM
Subject: Document Clustering

Hi,

does anyone have any sample code/documentation available for doing document
based clustering using lucene?
Thanks,
Marc


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

--
day time: www.media-style.com
spare time: www.text-mining.org | www.weta-group.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread maurits van wijland
Hi All and Marc,

There is the carrot project :
http://www.cs.put.poznan.pl/dweiss/carrot/

The carrot system consists of webservices that can easily be fed by a lucene
resultlist. You simply have to create a JSP that creates this XML file and
create a custom process and input component. The input component
for lucene could look like:


http://www.dawidweiss.com/projects/carrot/componentDescriptor"; framework  =
"Carrot2">
http://localhost/weblucene/c2.jsp";
   infoURL  = "http://localhost/weblucene/";
/>


The c2.jsp file simply has to translate a resultlist into an XLM file such
as:


 ...
 1.0
 http://...
 sum 1
 snip 2


 ...
 1.0
 http://...
 sum 2
 snip 2



Feed this into the carrot system, and you will get a nice clustered
result list. The amazing part is of this clustering mechanism is that
the cluster labels are incredible, their great!

Then there is a open source project called Classifier4J that can
be used for classification, the oposite of clustering. These other
open source projects are a great addition to the Lucene system.

I hope this helps...

Marc, what are you building?? Maybe we can help!

Kind regards,

Maurits


- Original Message - 
From: "marc" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Tuesday, November 11, 2003 5:15 PM
Subject: Document Clustering


Hi,

does anyone have any sample code/documentation available for doing document
based clustering using lucene?

Thanks,
Marc



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Stefan Groschupf


Marcel Stor wrote:

Stefan Groschupf wrote:
 

Hi,
   

How is document clustering different/related to text categorization?
 

Clustering: try to find own categories and put documents that match
in it. You group all documents with minimal distance together.
   

Would I be correct to say that you have to define a "distance threshold"
parameter in order to define when to build a new category for a certain
group?
 

I'm not sure. There are different data mining algorithms that could be used. Depends on this algoritm. I prefer Support vector machines(SVM). There you calculate distances of multi demensional vectors in a multidemensional "room".
One vector represent one document. 

Stefan




Re: Document Clustering

2003-11-11 Thread Joshua O'Madadhain
On Tuesday, Nov 11, 2003, at 11:05 US/Pacific, Marcel Stor wrote:

Stefan Groschupf wrote:
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match
in it. You group all documents with minimal distance together.
Would I be correct to say that you have to define a "distance 
threshold"
parameter in order to define when to build a new category for a certain
group?
Depends on the type of clustering algorithm.  Some clustering 
algorithms take the number of clusters as a parameter (in this case the 
algorithm may be run several times with different values, to determine 
the best value).  Other types of algorithms, such as hierarchical 
agglomerative clustering algorithms, work more as you suggest.

Regards,

Joshua O'Madadhain

 [EMAIL PROTECTED] Per 
Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, 
Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill 
Watterson
My opinions are too rational and insightful to be those of any 
organization.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Document Clustering

2003-11-11 Thread Marcel Stor
Stefan Groschupf wrote:
> Hi,
> > How is document clustering different/related to text categorization?
> 
> Clustering: try to find own categories and put documents that match
> in it. You group all documents with minimal distance together.

Would I be correct to say that you have to define a "distance threshold"
parameter in order to define when to build a new category for a certain
group?

> Classification: you have already categories and samples for
> it, that help you to match other documents.
> You calculate document distances to the existing categories
> and put it in the category with smallest distance.

Regards,
Marcel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Otis Gospodnetic
Thanks for the clarification, Stefan.  I should have known that... :)

Otis

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi,
> >How is document clustering different/related to text categorization?
> 
> Clustering: try to find own categories and put documents that match
> in it. 
> You group all documents with minimal distance together. 
> 
> Classification: you have already categories and samples for it, that
> help you to match other documents. 
> You calculate document distances to the existing categories and put
> it in the category with smallest distance.
> 
> Cheers
> Stefan
> 
> -- 
> day time: www.media-style.com
> spare time: www.text-mining.org | www.weta-group.net
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi,
How is document clustering different/related to text categorization?
Clustering: try to find own categories and put documents that match in it. 
You group all documents with minimal distance together. 

Classification: you have already categories and samples for it, that help you to match other documents. 
You calculate document distances to the existing categories and put it in the category with smallest distance.

Cheers
Stefan
--
day time: www.media-style.com
spare time: www.text-mining.org | www.weta-group.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 16:58, Tate Avery wrote:

Categorization typically assigns documents to a node in a pre-defined 
taxonomy.

For clustering, however, the categorization 'structure' is emergent... 
i.e. the clusters (which are analogous to taxonomy nodes) are created 
dynamically based on the content of the documents at hand.
Another way to look at it is this:

"An attempt to apply the Dewey Decimal system to an orgy." [1]

Without a Dewey Decimal system that is.

Cheers,

PA.

[1] http://www.eod.com/devil/archive/semantic_web.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Document Clustering

2003-11-11 Thread Tate Avery
Categorization typically assigns documents to a node in a pre-defined taxonomy.

For clustering, however, the categorization 'structure' is emergent... i.e. the 
clusters (which are analogous to taxonomy nodes) are created dynamically based on the 
content of the documents at hand.


-Original Message-
From: petite_abeille [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 11, 2003 10:50 AM
To: Lucene Users List
Subject: Re: Document Clustering


Hi Otis,

On Nov 11, 2003, at 16:41, Otis Gospodnetic wrote:

> How is document clustering different/related to text categorization?

Not that I'm an expert in any of this, but clustering is a much more 
"holistic" approach than categorization. Usually, categorization is 
understood as a more precise endeavor (e.g. dmoz.org), while clustering 
is much more "fuzzy" and non-deterministic. Both try to achieve the 
same goal though. So perhaps this is just a question of jargon.

I'm confident that the owner of this site could help bring some light 
on the finer point of clustering vs categorization:

http://www.lissus.com/resources/index.htm

Cheers,

PA.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread petite_abeille
Hi Otis,

On Nov 11, 2003, at 16:41, Otis Gospodnetic wrote:

How is document clustering different/related to text categorization?
Not that I'm an expert in any of this, but clustering is a much more 
"holistic" approach than categorization. Usually, categorization is 
understood as a more precise endeavor (e.g. dmoz.org), while clustering 
is much more "fuzzy" and non-deterministic. Both try to achieve the 
same goal though. So perhaps this is just a question of jargon.

I'm confident that the owner of this site could help bring some light 
on the finer point of clustering vs categorization:

http://www.lissus.com/resources/index.htm

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread petite_abeille
On Nov 11, 2003, at 16:05, Marcel Stör wrote:

As everybody seems to be so exited about it, would someone please be 
so kind to explain
what "document based clustering" is?
This mostly means finding document which are "similar" in some way(s). 
The "similitude" is mostly in the eyes of the beholder. In such a 
world, a "cluster" would be a pile of document sharing something. As 
far as Lucene goes, a straightforward way of approaching this could be 
to use an entire document content to query an index. Lucene's result 
set could be construed as a "document cluster". Admittedly, this is 
ground zero of "document clustering", but here you go anyway :)

Here is an illustration:

"Patterns in Unstructured Data"
Discovery, Aggregation, and Visualization
http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm
Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Otis Gospodnetic

--- Leo Galambos <[EMAIL PROTECTED]> wrote:
> Marcel Stör wrote:
> 
> >Hi
> >
> >As everybody seems to be so exited about it, would someone please be
> so kind to explain 
> >what "document based clustering" is?

AFAIK, "document clustering" consists of detection of documents with
similar content (similar subjects/topics).
 
> Hi
> 
> they are trying to implement what you can see in the right panel
> here:
> http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein
> They may also analyze identical pages (hit #9 and #10) - this could
> be 
> also taken as "clustering" AFAIK.

Intersting.

> For instance, Doug wrote some papers about clustering (if I remember
> it 
> correctly) - see his bibliography.


How is document clustering different/related to text categorization?

Thanks,
Otis


__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Leo Galambos
Marcel Stör wrote:

Hi

As everybody seems to be so exited about it, would someone please be so kind to explain 
what "document based clustering" is?
 

Hi

they are trying to implement what you can see in the right panel here:
http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein
They may also analyze identical pages (hit #9 and #10) - this could be 
also taken as "clustering" AFAIK.

For instance, Doug wrote some papers about clustering (if I remember it 
correctly) - see his bibliography.

Leo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Marcel Stör
Hi

As everybody seems to be so exited about it, would someone please be so kind to 
explain 
what "document based clustering" is?

Regards,
Marcel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Eric Jain
> I'm working on it. Classification and Clustering as well.

Very interesting... if you get something working, please don't forget to
notify this list :-)

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Clustering

2003-11-11 Thread Stefan Groschupf
Hi Marc,

I'm working on it. Classification and Clustering as well.
I was planing doing it for nutch.org, but actually some guys there 
breakup some important basic work I already had done, so may be i will 
not contribute it there.
However it will be open source and I can notice you  if something useful 
is ready.
But it will take some weeks. I actually working on radical minimizing of 
feature selection

Cheers
Stefan


marc wrote:

Hi,

does anyone have any sample code/documentation available for doing document based clustering using lucene?

Thanks,
Marc
 

--
day time: www.media-style.com
spare time: www.text-mining.org | www.weta-group.net


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]