Hi David,

Interesting problem.

Just typing out loud here...

Depending on how sloppy you want the match to be, and the max # of words that 
you'd want to consider as a prefix (or suffix) to an article, then one scalable 
approach (just considering prefix) is...

1. Pick a max # of words for the prefix, let's call this N

2. Pick a max slop value, let's call this S

3. Tokenize the first N words in the article

4. Output word/position pairs, for position +/- S. E.g.

<word>, <position>,
<word>, <position - 1>
<word>, <position + 1>
…

5. Calculate counts for each <word>,<position>

6. Group on <word>, sort by <position>, and merge counts for positions that are 
close enough (take the average position)_

7. Calculate document frequency for each remaining <word>, <position>

8. Filter out any results with DF less than some threshold (e.g. 0.20)

9. Save the resulting <word>,<position> values

All of the above I'd do in a map-reduce job (assuming you've got a significant 
amount of data), and I'd add in an implicit grouping by domain - unless you 
need this to span domains.

Then when you get an article, tokenize it, then walk it (word by word). If you 
find a matching word with a position that's close enough, mask it, otherwise 
skip the word.

When you get more than some count of skipped words in a row, or a total count, 
then you're done.

-- Ken

On Jun 3, 2014, at 12:16am, David Noel <david.i.n...@gmail.com> wrote:

> I'm clustering a pretty typical use case (news articles), but I keep
> running into a problem that ends up ruining the final cluster quality:
> noise, or "junk" sentences appended or prepended to the articles by
> the news outlet. I removing common noise from datasets is a problem
> common to many domains (news, bioinformatics, etc) so I figure there
> must be some solution to it in existence already. Does anyone know of
> any libraries to clean common strings from a set of strings (Java,
> preferably)?
> 
> I'm scraping pages from news outlets using HTMLUnit and passing the
> output to Boilerpipe to extract the article contents. I've noticed
> that Boilerpipe doesn't always do that great of a job. Often noise
> will slip through and when I cluster the data the results are skewed
> because of it.
> 
> Examples of common "junk" sentences are as follows:
> 
> -”Get Connected! MASNsports.com is your online home for the latest
> Orioles and Nationals news, features, and commentary. And now, you can
> connect with MASN on every digital level. From web and social media to
> our new mobile alert service, MASN has got all the bases covered. Get
> social!”
> 
> -”Home KKTV firmly believes in freedom of speech for all and we are
> happy to provide this forum for the community to share opinions and
> facts. We ask that commenters keep it clean, keep it truthful, stay on
> topic and be responsible. Comments left here do not necessarily
> represent the viewpoint of KKTV 11 News. If you believe that any of
> the comments on our site are inappropriate or offensive, please tell
> us by clicking “Report Abuse” and answering the questions that follow.
> We will review any reported comments promptly.”
> 
> -”(TM and © Copyright 2014 CBS Radio Inc. and its relevant
> subsidiaries. CBS RADIO and EYE Logo TM and Copyright 2014 CBS
> Broadcasting Inc. Used under license. All Rights Reserved. This
> material may not be published, broadcast, rewritten, or redistributed.
> The Associated Press contributed to this report.)”
> 
> -”(© Copyright 2014 The Associated Press. All Rights Reserved. This
> material may not be published, broadcast, rewritten or
> redistributed.)”
> 
> ..and on.
> 
> I've played around with a number of different methods to clean the
> dataset prior to clustering: manually gathering and scrubbing common
> substrings, using various LCS implementations (Longest Common
> Subsequence), computing the Levenshtein distance for all possible
> substrings, and on, but I've put a significant amount of time into
> them and haven't had the greatest results. So I figure I'd ask if
> anyone knows of any library that does something along the lines of
> what I'm trying to do. Has anyone had any luck finding such a thing?
> 
> Many thanks,
> 
> -David

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to