Hi all,

I am stuck between a decision to apply classification or clustering on the
data set I got. The more I think about it, the more I get confused. Heres
what I am confronted with.

I have got news documents (around 3000 and continuously increasing)
containing news about companies, investment, stocks, economy, quartly
income etc. My goal is to have the news sorted in such a way that I know
which news correspond to which company. e.g for the news item "Apple
launches new iphone", I need to associate the company Apple with it. A
particular news item/document only contains 'title' and 'description' so I
have to analyze the text in order to find out which company the news
referes to. It could be multiple companies too.

To solve this, I turned to Mahout.

I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
etc as top terms in my clusters and from there I would know the news in a
cluster corresponds to its cluster label, but things were a bit different.
I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal',
'shares', 'street', 'olympics' and lots of other terms as the top ones
(which makes sense as clustering algos' look for common terms). Although
there were some 'Apple' clusters but the news items associated with it were
very few.I thought may be clustering is not for this kind of problem as
many of the company news goes into more general clusters(investment,
profit) instead of the specific company cluster(Apple).

I started reading about classification which requires training data, The
name was convincing too as I actually want to 'classify' my news items into
'company names'. As I read on, I got an impression that the name
classification is a bit deceiving and the technique is used more for
prediction purposes as compared to classification. The other confusions
that I got was how can I prepare training data for news documents? lets
assume I have a list of companies that I am interested in. I write a
program to produce training data for the classifier. the program will see
if the news title or description contains the company name 'Apple' then its
a news story about apple. Is this how I can prepare training data?(off
course I read that training data is actually a set of predictors and target
variables). If so, then why should I use mahout classification in the first
place? I should ditch mahout and instead use this little program that I
wrote for training data(which actually does the classification)

You can see how confused I am about how to address this issue. Another
thing that concerns me is that if its possible to make a system this
intelligent, that if the news says 'iphone sales at a record high' without
using the word 'Apple', the system can classify it as a news related to
apple?

Thank you in advance for pointing me in the right direction.

Reply via email to