I have looked but can't find the postings by a student who recently posted about their FAQ extraction program. The results were pretty good in terms of precision and the extracted answers were very nice. The methods used were quite simple.
Does anybody else remember this interchange? Did it not occur here? Did I imagine it? On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <al...@shopzilla.com> wrote: > Is there any easy way to export this data from sematext / stack overflow? > Or is web crawling/scraping the way to go here? > > This is a good use case for Mahout, I've been looking for a problem to play > around on mahout with :) > > > On 3/2/11 1:05 AM, "Friso van Vollenhoven" <fvanvollenho...@xebia.com> > wrote: > > > You could try using Apache Mahout to at least cluster the messages into > groups > > of similar ones based on text features. That should be doable. Given the > > groups, you could manually extract questions (the clusters with most > threads > > could be the most frequently asked). Also, if you manage to get this to > work > > nicely, it could be a nice tool for other projects as well. Would be a > fun > > exercise anyways... > > > > I am starting to toy with Mahout for another pet project. Once I get more > > comfortable with it, I might be able to take this on (not a promise). > > > > I think automatic question extraction is a quite ambitious goal. > > > > Friso > > > > > > > > On 1 mrt 2011, at 19:12, Stack wrote: > > > >> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic > >> <otis_gospodne...@yahoo.com> wrote: > >>>> Do you have something in mind? Could we be making better use of the > >>>> sematext summaries? > >>> > >>> Hm... we already index HBase and other Digests on search-hadoop.com. > >>> I was thinking more along the lines of mining the ML archives and doing > >>> automatic Q&A extraction. > >>> I don't know how difficult it would be. Maybe the input would be too > noisy > >>> (people don't ask proper questions, answers are not full sentences, > quote > >>> characters prefixing lines from old messages add a layer of > complexity...), > >>> but > >>> that's what I thought you might have meant. > >>> > >> > >> That'd be a nice addition to the docs. Our FAQ is in need of > >> updating. This would be a nice undertaking if someone was up for > >> taking it on. > >> St.Ack > > > > > >