Support for field-specific tokenizers, token- and character filters in search
results clustering
------------------------------------------------------------------------------------------------
Key: SOLR-2917
URL: https://issues.apache.org/jira/browse/SOLR-2917
Project: Solr
Issue Type: Improvement
Components: contrib - Clustering
Reporter: Stanislaw Osinski
Assignee: Stanislaw Osinski
Fix For: 3.6
Currently, Carrot2 search results clustering component creates clusters based
on the raw text of a field. The reason for this is that Carrot2 aims to create
meaningful cluster labels by using sequences of words taken directly from the
documents' text (including stop words: _Development of Lucene and Solr_ is more
readable than _Development Lucene Solr_). The easiest way of providing input
for such a process was feeding Carrot2 with raw (stored) document content.
It is, however, possible to take into account +some+ of the field's filters
during clustering. Because Carrot2 does not currently expose an API for feeding
pre-tokenized input, the clustering component would need to:
1. get raw text of the field,
2. run it through the field's char filters, tokenizers and selected token
filters (omitting e.g. stop words filter and stemmers, Carrot2 needs the
original words to produce readable cluster labels),
3. glue the output back into a string and feed to Carrot2 for clustering.
In the future, to eliminate step 3, we could modify Carrot2 to accept
pre-tokenized content.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]