Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Miles Fri, 13 Nov 2009 11:49:21 -0800

A very simple way to remove boilerplate (and this is trivial using MapReduce) is to just remove all duplicate sentences. This does assumeyou can extract sentences, do sentence boundary detection etc.


Miled
Sent from your Ipod



On 13 Nov 2009, at 19:06, Ted Dunning <[email protected]> wrote:

This looks like a very nice approach for getting rid of the goo. Ioftenadvocate using words/phrases/ngrams that are highly predicted by thedomainname as an alternative for removing boilerplate. That has theadvantagethat it doesn't require training text. In the case of wiki-pedia,this is
not so useful because everything is in the same domain.  The domain
predictor trick will only work if the feature you are using for theinput isnot very content based. Thus, this can fail for small domain-focused sites
or if you use a content laden URL for the task.



On Fri, Nov 13, 2009 at 10:36 AM, Ken Krugler
<[email protected]>wrote:
Hi all,

Another issue came up, about cleaning the text.

One interested user suggested using nCleaner (see
http://www.lrec-conf.org/proceedings/lrec2008/pdf/885_paper.pdf) asa way
of tossing boilerplate text that skews text frequency data.

Any thoughts on this?

Thanks,

-- Ken


On Nov 3, 2009, at 5:43am, Grant Ingersoll wrote:

Might be of interest to all you Mahouts out there...
http://bixolabs.com/datasets/public-terabyte-dataset-project/
Would be cool to get this converted over to our vector format sothat we
can cluster, etc.
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g
--
Ted Dunning, CTO
DeepDyve

Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/

Reply via email to