This is Crunchbase?
If your goal is to classify on what the company *does*, then I think
you are best ignoring most data (funding, employees, etc.) and cluster
their descriptions and/or text of articles about them as if they are
documents. In this sense it is similar to 20 newsgroups, yes. You'd
have to extract the text from Crunchbase first and with those as text
docs, the process is the same.

On Tue, Jul 26, 2011 at 10:17 PM, Shrikar archak <shrika...@gmail.com> wrote:
> Hi All,
> I am new to Machine learning and wanted to know more about Mahout in
> general and how we can apply these algortithms to our applications.
>
> I wanted to try out this example:
>
> Techcrunch has the company database and also information about what that
> company does.
> I was thinking if we can use Mahout's Classifying algorithms which could
> take these info
> pages and classify them companies into different categories..
>
> One more thing would be to look at their job description and find out what
> technologies they are
> using and classify them.
>
> What would be the steps required to get this done..
> I tried out Twenty
> Newsgroups<https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups>example
> in which case we need to train it.  I assume we need to
> do something like that for the problem described above.
> Please let me know.
>
> Thanks,
> Shrikar
>

Reply via email to