[CTRL] Researchers 'text mine' The New York Times, demonstrating ease of new technology

RoadsEnd Sun, 30 Jul 2006 20:01:39 -0700

-Caveat Lector-

http://www.physorg.com/news73141363.html

Researchers 'text mine' The New York Times, demonstrating ease of newtechnology



Physics Ringtones - 14 Physics Ringtones available. 100% Free!

Performing what a team of dedicated and bleary-eyed newspaperlibrarians would need months to do, scientists at UC Irvine have usedan up-and-coming technology to complete in hours a complex topicanalysis of 330,000 stories published primarily by The New York Times.

The demonstration is significant because it is one of the earliestshowing that an extremely efficient, yet very complicated, technologycalled text mining is on the brink of becoming a tool useful to morethan highly trained computer programmers and homeland security experts.

"We have shown in a very practical way how a new text miningtechnique makes understanding huge volumes of text quicker andeasier," said David Newman, a computer scientist in the Donald BrenSchool of Information and Computer Sciences at UCI. "To put itsimply, text mining has made an evolutionary jump. In just a fewshort years, it could become a common and useful tool for everyonefrom medical doctors to advertisers; publishers to politicians."

Text mining allows a computer to extract useful information fromunstructured text. Until recently, text mining required a great dealof preparation before documents could be analyzed in a meaningfulway. A new text-mining technique called "topic modeling" -- which UCIscientists used in their New York Times experiment -- looks forpatterns of words that tend to occur together in documents, thenautomatically categorizes those words into topics -- all with minimalhuman effort.

UCI researchers didn't invent topic modeling, but they developed atechnique that allows the technology to be used on huge documentcollections. They also are among the first to demonstrate its easeand effectiveness by applying it to a newspaper archive. The resultsreveal few surprises, but the application demonstrates the ability oftopic modeling to spot trends and make connections in a way thatcould be applied to more complicated and cumbersome documents such asthose used by medical researchers and lawyers.

Newman and UCI researchers Padhraic Smyth, Mark Steyvers andChaitanya Chemudugunta presented their research at the recentIntelligence and Security Informatics conference in San Diego.

The topic model, applied to the collection of news articles publishedfrom 2000 to 2002, identified patterns of words that occurredtogether in the stories. From those words, researchers were able toidentify topics. Information associated with those topics was chartedover time, allowing the scientists to pinpoint what months of theyear certain topics were most in the news and how much ink theyreceived from year to year.

"If I were interested in advertising a product related to the Tour deFrance, I might want to know whether interest in the Tour de Franceis increasing or decreasing," Newman said. "This might be veryimportant knowledge."

Including the Tour de France, the model automatically identified atotal of 400 topics ranging from renting apartments in Brooklyn anddiving in Hawaii to voting irregularities and dinosaur bones. As fornewsmakers, topics included Tiger Woods, Elian Gonzalez, DenzelWashington and Barbie.

"Text mining is an incredible tool," Newman said. "It already allowsa doctor to identify the common thread in old and new medicalresearch. With topic modeling, connections can be drawn faster andmore efficiently in large volumes of text."

About topic modeling: UCI researchers performed their experimentusing a statistical topic model based on a text model developed at UCBerkeley in 2003. Thanks to an improved solution technique proposedby Mark Steyvers and a research partner, this model has advanced fromacademic use to something that is now widely used in the researchcommunity. Topic modeling looks for patterns of words that tend tooccur together in documents, then automatically categorizes thosewords into topics. Older text-mining techniques require the user tocome up with an appropriate set of topic categories and manually findhundreds to thousands of example documents for each category. Thishuman-intensive process is called supervised learning. In contrast,topic modeling, a type of unsupervised learning, doesn't needsuggestions for an appropriate set of topic categories or human-foundexample documents. This makes retrieving information easier and quicker.

Source: University of California - Irvine

www.ctrl.org
DECLARATION & DISCLAIMER
==========
CTRL is a discussion & informational exchange list. Proselytizing propagandic
screeds are unwelcomed. Substanceânot soap-boxingâplease!   These are
sordid matters and 'conspiracy theory'âwith its many half-truths, mis-
directions and outright fraudsâis used politically by different groups with
major and minor effects spread throughout the spectrum of time and thought.
That being said, CTRLgives no endorsement to the validity of posts, and
always suggests to readers; be wary of what you read. CTRL gives no
credence to Holocaust denial and nazi's need not apply.

Let us please be civil and as always, Caveat Lector.
========================================================================
Archives Available at:

http://www.mail-archive.com/[email protected]/
<A HREF="http://www.mail-archive.com/[email protected]/";>ctrl</A>
========================================================================
To subscribe to Conspiracy Theory Research List[CTRL] send email:
SUBSCRIBE CTRL [to:] [EMAIL PROTECTED]

To UNsubscribe to Conspiracy Theory Research List[CTRL] send email:
SIGNOFF CTRL [to:] [EMAIL PROTECTED]

Om

[CTRL] Researchers 'text mine' The New York Times, demonstrating ease of new technology

Reply via email to