Is there any easy way to export this data from sematext / stack overflow?
Or is web crawling/scraping the way to go here?

This is a good use case for Mahout, I've been looking for a problem to play
around on mahout with :)


On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fvanvollenho...@xebia.com>
wrote:

> You could try using Apache Mahout to at least cluster the messages into groups
> of similar ones based on text features. That should be doable. Given the
> groups, you could manually extract questions (the clusters with most threads
> could be the most frequently asked). Also, if you manage to get this to work
> nicely, it could be a nice tool for other projects as well. Would be a fun
> exercise anyways...
> 
> I am starting to toy with Mahout for another pet project. Once I get more
> comfortable with it, I might be able to take this on (not a promise).
> 
> I think automatic question extraction is a quite ambitious goal.
> 
> Friso
> 
> 
> 
> On 1 mrt 2011, at 19:12, Stack wrote:
> 
>> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> <otis_gospodne...@yahoo.com> wrote:
>>>> Do you have  something in mind?  Could we be making better use of the
>>>> sematext  summaries?
>>> 
>>> Hm... we already index HBase and other Digests on search-hadoop.com.
>>> I was thinking more along the lines of mining the ML archives and doing
>>> automatic Q&A extraction.
>>> I don't know how difficult it would be.  Maybe the input would be too noisy
>>> (people don't ask proper questions, answers are not full sentences, quote
>>> characters prefixing lines from old messages add a layer of complexity...),
>>> but
>>> that's what I thought you might have meant.
>>> 
>> 
>> That'd be a nice addition to the docs.  Our FAQ is in need of
>> updating.  This would be a nice undertaking if someone was up for
>> taking it on.
>> St.Ack
> 
> 

Reply via email to