Equally important in the long run are the databases that will be created by the nearly spontaneous aggregation of scores or hundreds of smaller databases. “What seem to be small-scale, discrete systems end up being combined into large databases,” says Marc Rotenberg, executive director of the Electronic Privacy Information Center, a nonprofit research organization in Washington, DC. He points to the recent, voluntary efforts of merchants in Washington’s affluent Georgetown district. They are integrating their in-store closed-circuit television networks and making the combined results available to city police. In Rotenberg’s view, the collection and consolidation of individual surveillance networks into big government and industry programs “is a strange mix of public and private, and it’s not something that the legal system has encountered much before.”
Managing the sheer size of these aggregate surveillance databases, surprisingly, will not pose insurmountable technical difficulties. Most personal data are either very compact or easily compressible. Financial, medical, and shopping records can be represented as strings of text that are easily stored and transmitted; as a general rule, the records do not grow substantially over time.


Even biometric records are no strain on computing systems. To identify people, genetic-testing firms typically need stretches of DNA that can be represented in just one kilobyte—the size of a short e-mail message. Fingerprints, iris scans, and other types of biometric data consume little more. Other forms of data can be preprocessed in much the way that the cameras on Route 9 transform multimegabyte images of cars into short strings of text with license plate numbers and times. (For investigators, having a video of suspects driving down a road usually is not as important as simply knowing that they were there at a given time.) To create a digital dossier for every individual in the United States—as programs like Total Information Awareness would require—only “a couple terabytes of well-defined information” would be needed, says Jeffrey Ullman, a former Stanford University database researcher. “I don’t think that’s really stressing the capacity of [even today’s] databases.”

Instead, argues Rajeev Motwani, another member of Stanford’s database group, the real challenge for large surveillance databases will be the seemingly simple task of gathering valid data. Computer scientists use the term GIGO—garbage in, garbage out—to describe situations in which erroneous input creates erroneous output. Whether people are building bombs or buying bagels, governments and corporations try to predict their behavior by integrating data from sources as disparate as electronic toll-collection sensors, library records, restaurant credit-card receipts, and grocery store customer cards—to say nothing of the Internet, surely the world’s largest repository of personal information. Unfortunately, all these sources are full of errors, as are financial and medical records. Names are misspelled and digits transposed; address and e-mail records become outdated when people move and switch Internet service providers; and formatting differences among databases cause information loss and distortion when they are merged. “It is routine to find in large customer databases defective records—records with at least one major error or omission—at rates of at least 20 to 35 percent,” says Larry English of Information Impact, a database consulting company in Brentwood, TN.

Unfortunately, says Motwani, “data cleaning is a major open problem in the research community. We are still struggling to get a formal technical definition of the problem.” Even when the original data are correct, he argues, merging them can introduce errors where none had existed before. Worse, none of these worries about the garbage going into the system even begin to address the still larger problems with the garbage going out.
People passing through Manhattan’s Times Square area leave a trail of images on scores of webcams and private and city-owned surveillance cameras. New York privacy activist Bill Brown compiled this map in September 2002.
The Dissolution of Privacy


Almost every computer-science student takes a course in algorithms. Algorithms are sets of specified, repeatable rules or procedures for accomplishing tasks such as sorting numbers; they are, so to speak, the engines that make programs run. Unfortunately, innovations in algorithms are not subject to Moore’s law, and progress in the field is notoriously sporadic. “There are certain areas in algorithms we basically can’t do better and others where creative work will have to be done,” Ullman says. Sifting through large surveillance databases for information, he says, will essentially be “a problem in research in algorithms. We need to exploit some of the stuff that’s been done in the data-mining community recently and do it much, much better.”

TOPIC > Security and Defense> Biometrics

Surveillance Nation
By Dan Farmer and Charles C. Mann
April 2003

The Dissolution of Privacy

Almost every computer-science student takes a course in algorithms. Algorithms are sets of specified, repeatable rules or procedures for accomplishing tasks such as sorting numbers; they are, so to speak, the engines that make programs run. Unfortunately, innovations in algorithms are not subject to Moore’s law, and progress in the field is notoriously sporadic. “There are certain areas in algorithms we basically can’t do better and others where creative work will have to be done,” Ullman says. Sifting through large surveillance databases for information, he says, will essentially be “a problem in research in algorithms. We need to exploit some of the stuff that’s been done in the data-mining community recently and do it much, much better.”

Working with databases requires users to have two mental models. One is a model of the data. Teasing out answers to questions from the popular search engine Google, for example, is easier if users grasp the varieties and types of data on the Internet—Web pages with words and pictures, whole documents in a multiplicity of formats, downloadable software and media files—and how they are stored. In exactly the same way, extracting information from surveillance databases will depend on a user’s knowledge of the system. “It’s a chess game,” Ullman says. “An unusually smart analyst will get things that a not-so-smart one will not.”

Second, and more important according to Spafford, effective use of big surveillance databases will depend on having a model of what one is looking for. This factor is especially crucial, he says, when trying to predict the future, a goal of many commercial and government projects. For this reason, what might be called reactive searches that scan recorded data for specific patterns are generally much more likely to obtain useful answers than proactive searches that seek to get ahead of things. If, for instance, police in the Washington sniper investigation had been able to tap into a pervasive network of surveillance cameras, they could have tracked people seen near the crime scenes until they could be stopped and questioned: a reactive process. But it is unlikely that police would have been helped by proactively asking surveillance databases for the names of people in the Washington area with the requisite characteristics (family difficulties, perhaps, or military training and a recent penchant for drinking) to become snipers.

In many cases, invalid answers are harmless. If Victoria’s Secret mistakenly mails 1 percent of its spring catalogs to people with no interest in lingerie, the price paid by all parties is small. But if a national terrorist-tracking system has the same 1 percent error rate, it will produce millions of false alarms, wasting huge amounts of investigators’ time and, worse, labeling many innocent U.S. citizens as suspects. “A 99 percent hit rate is great for advertising,” Spafford says, “but terrible for spotting terrorism.”

Because no system can have a success rate of 100 percent, analysts can try to decrease the likelihood that surveillance databases will identify blameless people as possible terrorists. By making the criteria for flagging suspects more stringent, officials can raise the bar, and fewer ordinary citizens will be wrongly fingered. Inevitably, however, that will mean also that the “borderline” terrorists—those who don’t match all the search criteria but still have lethal intentions—might be overlooked as well. For both types of error, the potential consequences are alarming.

Yet none of these concerns will stop the growth of surveillance, says Ben Shneiderman, a computer scientist at the University of Maryland. Its potential benefits are simply too large. An example is what Shneiderman, in his recent book Leonardo’s Laptop: Human Needs and the New Computing Technologies, calls the World Wide Med: a global, unified database that makes every patient’s complete medical history instantly available to doctors through the Internet, replacing today’s scattered sheaves of paper records (see “Paperless Medicine,”). “The idea,” he says, “is that if you’re brought to an ER anywhere in the world, your medical records pop up in 30 seconds.” Similar programs are already coming into existence. Backed by the Centers for Disease Control and Prevention, a team based at Harvard Medical School is planning to monitor the records of 20 million walk-in hospital patients throughout the United States for clusters of symptoms associated with bioterror agents. Given the huge number of lost or confused medical records, the benefits of such plans are clear. But because doctors would be continually adding information to medical histories, the system would be monitoring patients’ most intimate personal data. The network, therefore, threatens to violate patient confidentiality on a global scale.

In Shneiderman’s view, such tradeoffs are inherent to surveillance. The collective by-product of thousands of unexceptionable, even praiseworthy efforts to gather data could be something nobody wants: the demise of privacy. “These networks are growing much faster than people realize,” he says. “We need to pay attention to what we’re doing right now.”

In The Conversation, surveillance expert Harry Caul is forced to confront the tradeoffs of his profession directly. The conversation in Union Square provides information that he uses to try to stop a murder. Unfortunately, his faulty interpretation of its meaning prevents him from averting tragedy. Worse still, we see in scene after scene that even the expert snoop is unable to avoid being monitored and recorded. At the movie’s intense, almost wordless climax, Caul rips his home apart in a futile effort to find the electronic bugs that are hounding him.

The Conversation foreshadowed a view now taken by many experts: surveillance cannot be stopped. There is no possibility of “opting out.” The question instead is how to use technology, policy, and shared societal values to guide the spread of surveillance—by the government, by corporations, and perhaps most of all by our own unwitting and enthusiastic participation—while limiting its downside.

--------------------------------------------------------------------------------
Next month: how surveillance technology is changing our definition of privacy—and why the keys to preserving it may be in the technology itself.


Dan Farmer is a software engineer and computer security expert.
Charles C. Mann has written for Technology Review about the free software movement (January/February 1999) and the use of genetic engineering in agriculture (July/August 1999).


http://www.technologyreview.com/articles/farmer0403.asp?p=6



Reply via email to