Re: Assign rich-text document's title name from clustering results

2015-06-10 Thread Upayavira
It depends a lot on what the documents are. Some document formats have metadata that stores a title. Perhaps you can just extract that. If not, once you've extracted the content, perhaps you could just have a special field that is the first n words (followed by an ellipsis). If you use a

Re: Assign rich-text document's title name from clustering results

2015-06-10 Thread Zheng Lin Edwin Yeo
The main objective here is actually to assign a title to the documents as they are being indexed. We actually found that the cluster labels provides a good information on the key points of the documents, but I'm not sure if we can get a good cluster labels with a single documents. Besides

Re: Assign rich-text document's title name from clustering results

2015-06-10 Thread Alessandro Benedetti
Hi Edwin, let's do this step by step. Clustering is problem solved by unsupervised machine learning algorithms. The scope of clustering is to group per similarity a corpus of documents, trying to have meaningful groups for a human being. Solr currently provides different approaches for *Query

Re: Assign rich-text document's title name from clustering results

2015-06-10 Thread Alessandro Benedetti
I agree with Upayavira, Title extraction is an activity independent from Solr. Furthermore I would say it's easy to extract the title before the Solr Indexng stage. When we send the content arrives to Solr Update processors it is already a String. If you want to do some clever title extraction,

Assign rich-text document's title name from clustering results

2015-06-09 Thread Zheng Lin Edwin Yeo
Hi, I'm currently using Solr 5.1, and I'm thinking of ways to allow the system to automatically give the rich-text documents that are being indexed a title automatically, instead of user entering it in manually, as we might have to index a whole folder of documents together, so it is not wise for