subject:"Clustering from anlayzed text instead of raw input"

Re: Clustering from anlayzed text instead of raw input

2010-03-05 Thread Stanislaw Osinski

  I'll give a try to stopwords treatbment, but the problem is that we
 perform
 POS tagging and then use payloads to keep only Nouns and Adjectives, and we
 thought that could be interesting to perform clustering only with these
 elements, to avoid senseless words.


POS tagging could help a lot in clustering (not yet implemented in Carrot2
though), but ideally, we'd need to have POS tags attached to the original
tokenized text (so each token would be a tuple along the lines of: raw_text
+ stemmed + POS). If we have just nouns and adjectives, cluster labels will
be most likely harder to read (e.g. because of missing prepositions). I'm
not too familiar with Solr internals, but I'm assuming this type of
representation should be possible to implement using payloads? Then, we
could refactor Carrot2 a bit to work either on raw text or on the
tokenized/augmented representation.

Cheers,

S.

Clustering from anlayzed text instead of raw input

2010-03-03 Thread JCodina


I'm trying to use  carrot2 (now I started with the workbench) and I can
cluster any field, but, the text used for clustering is the original raw
text, the one that was indexed, without any of the processing performed by
the tokenizer or filters. 
So I get stop words.
 I also did shingles (after filtering by POS) and I can not cluster using
these multiwords. 
So my question is about how to get in a query answer the indexed text
instead of the original one, because if I set stored to false, then the
search does not return the content of the field.

Tahnks in advance

Joan
-- 
View this message in context: 
http://old.nabble.com/Clustering-from-anlayzed-text-instead-of-raw-input-tp27765780p27765780.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Clustering from anlayzed text instead of raw input

2010-03-03 Thread Stanislaw Osinski

Hi Joan,

I'm trying to use  carrot2 (now I started with the workbench) and I can
 cluster any field, but, the text used for clustering is the original raw
 text, the one that was indexed, without any of the processing performed by
 the tokenizer or filters.
 So I get stop words.


The easiest way to fix this is to update the stop words list used by
Carrot2, see http://wiki.apache.org/solr/ClusteringComponent, Tuning
Carrot2 clustering section at the bottom. If you want to get readable
cluster labels, it's best to feed the raw text for clustering (cluster
labels are phrases taken from the input text, if you remove stopwords and
stem everything, the phrases will become unreadable).

Cheers,

Staszek

Re: Clustering from anlayzed text instead of raw input

2010-03-03 Thread JCodina


Thanks Staszek
 I'll give a try to stopwords treatbment, but the problem is that we perform
POS tagging and then use payloads to keep only Nouns and Adjectives, and we
thought that could be interesting to perform clustering only with these
elements, to avoid senseless words.

Of course is a problem of clustering, but maybe is also a feature that could
be interesting to have in solr: not to index the raw input text but the
analyzed one, so stored could be False | Raw | analyzed


Stanislaw Osinski-2 wrote:
 
 Hi Joan,
 
 I'm trying to use  carrot2 (now I started with the workbench) and I can
 cluster any field, but, the text used for clustering is the original raw
 text, the one that was indexed, without any of the processing performed
 by
 the tokenizer or filters.
 So I get stop words.

 
 The easiest way to fix this is to update the stop words list used by
 Carrot2, see http://wiki.apache.org/solr/ClusteringComponent, Tuning
 Carrot2 clustering section at the bottom.
 
  If you want to get readable
 cluster labels, it's best to feed the raw text for clustering (cluster
 labels are phrases taken from the input text, if you remove stopwords and
 stem everything, the phrases will become unreadable).
 
 Cheers,
 
 Staszek
 
 

-- 
View this message in context: 
http://old.nabble.com/Clustering-from-anlayzed-text-instead-of-raw-input-tp27765780p27769034.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Clustering from anlayzed text instead of raw input

Clustering from anlayzed text instead of raw input

Re: Clustering from anlayzed text instead of raw input

Re: Clustering from anlayzed text instead of raw input

4 matches

Site Navigation

Mail list logo

Footer information