Is there any easy way to export this data from sematext / stack overflow? Or is web crawling/scraping the way to go here?
This is a good use case for Mahout, I've been looking for a problem to play around on mahout with :) On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fvanvollenho...@xebia.com> wrote: > You could try using Apache Mahout to at least cluster the messages into groups > of similar ones based on text features. That should be doable. Given the > groups, you could manually extract questions (the clusters with most threads > could be the most frequently asked). Also, if you manage to get this to work > nicely, it could be a nice tool for other projects as well. Would be a fun > exercise anyways... > > I am starting to toy with Mahout for another pet project. Once I get more > comfortable with it, I might be able to take this on (not a promise). > > I think automatic question extraction is a quite ambitious goal. > > Friso > > > > On 1 mrt 2011, at 19:12, Stack wrote: > >> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic >> <otis_gospodne...@yahoo.com> wrote: >>>> Do you have something in mind? Could we be making better use of the >>>> sematext summaries? >>> >>> Hm... we already index HBase and other Digests on search-hadoop.com. >>> I was thinking more along the lines of mining the ML archives and doing >>> automatic Q&A extraction. >>> I don't know how difficult it would be. Maybe the input would be too noisy >>> (people don't ask proper questions, answers are not full sentences, quote >>> characters prefixing lines from old messages add a layer of complexity...), >>> but >>> that's what I thought you might have meant. >>> >> >> That'd be a nice addition to the docs. Our FAQ is in need of >> updating. This would be a nice undertaking if someone was up for >> taking it on. >> St.Ack > >