Hi all,
Another issue came up, about cleaning the text.
One interested user suggested using nCleaner (see http://www.lrec-conf.org/proceedings/lrec2008/pdf/885_paper.pdf)
as a way of tossing boilerplate text that skews text frequency data.
Any thoughts on this?
Thanks,
-- Ken
On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:
Might be of interest to all you Mahouts out there...
http://bixolabs.com/datasets/public-terabyte-dataset-project/
Would be cool to get this converted over to our vector format so
that we can cluster, etc.
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g