Hi salman mahmood, Whydont you try to apply clustering first . Once you applied high level clustering then check the top terms . You avoid the cluster which you feel good and try to find inter cluster which you found that it has confusion . Once you found that all the clusters are fine . To make the cluster perfect I had indexed all the document into solr . Because by using solr I had removed stop words and applied snow ball filter like that . Then as you know the identified all the clusters . Now try to verify whether cluster top term are good . Now from that cluster by using cluster points split the documents and according to its cluster . Now you will have bunch document s as group . Now if you apply classification and train the set .
I hope u understood .. this is the approach I had followed . Let me know if you had some ideas . Syed Abdul kather send from Samsung S3 On Aug 1, 2012 10:38 PM, "Salman Mahmood" <salman...@gmail.com> wrote: > Hi all, > > I am stuck between a decision to apply classification or clustering on the > data set I got. The more I think about it, the more I get confused. Heres > what I am confronted with. > > I have got news documents (around 3000 and continuously increasing) > containing news about companies, investment, stocks, economy, quartly > income etc. My goal is to have the news sorted in such a way that I know > which news correspond to which company. e.g for the news item "Apple > launches new iphone", I need to associate the company Apple with it. A > particular news item/document only contains 'title' and 'description' so I > have to analyze the text in order to find out which company the news > referes to. It could be multiple companies too. > > To solve this, I turned to Mahout. > > I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel' > etc as top terms in my clusters and from there I would know the news in a > cluster corresponds to its cluster label, but things were a bit different. > I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal', > 'shares', 'street', 'olympics' and lots of other terms as the top ones > (which makes sense as clustering algos' look for common terms). Although > there were some 'Apple' clusters but the news items associated with it were > very few.I thought may be clustering is not for this kind of problem as > many of the company news goes into more general clusters(investment, > profit) instead of the specific company cluster(Apple). > > I started reading about classification which requires training data, The > name was convincing too as I actually want to 'classify' my news items into > 'company names'. As I read on, I got an impression that the name > classification is a bit deceiving and the technique is used more for > prediction purposes as compared to classification. The other confusions > that I got was how can I prepare training data for news documents? lets > assume I have a list of companies that I am interested in. I write a > program to produce training data for the classifier. the program will see > if the news title or description contains the company name 'Apple' then its > a news story about apple. Is this how I can prepare training data?(off > course I read that training data is actually a set of predictors and target > variables). If so, then why should I use mahout classification in the first > place? I should ditch mahout and instead use this little program that I > wrote for training data(which actually does the classification) > > You can see how confused I am about how to address this issue. Another > thing that concerns me is that if its possible to make a system this > intelligent, that if the news says 'iphone sales at a record high' without > using the word 'Apple', the system can classify it as a news related to > apple? > > Thank you in advance for pointing me in the right direction. >