Tom, Thanks for the feedback: [TM] > You could create an ontology as it "should" be or you > can use an ontology which matches the practices and conventions used > by the Wikipedia editors. The latter is going to be messy in many > ways, but at least it'll have a large quantity of data to work with.
The way an ontology *should be* is the way it will be most useful to those who intend to use it. That means, it should be comprehensible and acceptable to them. As languages go (an ontology serves as a logical language), that also means that it will be the sum of the inputs of those who use it, not something imposed by some external authority. The question that I have not been able to resolve in my brief look at the DBpedia site is, just how is it anticipated that the ontology will be used? Is there an application that uses it? The application that uses it will be the ultimate arbiter of how it "should be". I will much appreciate a reference to actual uses in applications, where I can see how it is used and whether additional precision may be useful. I am aware of how wary people (including myself) are of those who would want to impose some ontology or terminology on a community. There is a long history of such efforts. The common resistance to using a complex system devised by others (if something simpler seems to serve as well) is one of the reasons that CYC has not been more widely adopted. In general, a big reason for the lack of wide adoption of CYC (and other "upper ontologies") is that people will only make effort to use another system if they have examples of uses so that they are convinced it is worth the effort; - but all significant uses of CYC and SUMO are proprietary and details are not available to the public. But there are also many examples where people *do* make effort to learn a system devised elsewhere, including linguistic systems, when useful applications can be seen. It is common even among ontologists to say that people will prefer to use their own (language/databases/terminologies/ontologies) so that no one language/ontology/database will ever be adopted universally, but we have a fine example of just such adoption of a common language - English. If you go to an international conference, virtually everyone speaks English and presents in English if they want their contributions to be understood by the largest number of people; the motivation is sufficient for people to make the effort to learn the language. And that is where an ontology can serve any community, or the whole world - in any situation where the creators of knowledge want to share it - in a precise form suitable for automated reasoning - with the whole community, however large or small that community is. As I understand it, the community intended to be served by DBpedia is the whole world. That is very ambitious, but I feel certain from my own work that it is entirely feasible to create an ontology that will be suitable for that whole world community. It does take more effort than just automatically extracting triples from a data source, structured or unstructured. Such an ontology cannot be imposed from above, it has to grow from the needs and practices of the community that uses it. But it will benefit from the large amount of work already done building other ontologies. Much of the hard work has already been done. The problem with using extracted data triples *alone* as a representation of knowledge is that, except in carefully controlled systems, they have the same problems as natural language itself - the same term may be used with multiple meanings (ambiguity) or many terms may be used with the same meaning (polysemy). Using OWL is a good step, but OWL is only a simple *grammar* for representing knowledge. Communication requires a common *vocabulary* as well as a common grammar. Triple stores created without prior agreement on terminology may still be useful for some probabilistic reasoning purposes. Automated alignment of data from different sources mostly relies on string matching to identify terms that are likely to have the same meanings in different data stores. Reasoning with such databases can generate inferences that rank results by probabilities, and they can be sent to a human interpreter who has the final decision as to whether the inferences are meaningful or nonsensical (as in a Google search). The automated alignment methods I have seen (except in very narrowly constrained domains) tend to have no more than 60% accuracy for any one pair. Automated reasoning will have chains of inferences, and any chain more than one inference in length will likely result in a conclusion that is unlikely to be true - the longer the chain, the less likely an accurate result. So if automated inferencing on data is considered desirable, very high accuracy in the representation is necessary. The good news is that such high accuracy is in fact *practical* (not merely possible), if the proper approach is used. Although different groups and different communities will insist on using their own local terminology, accurate alignment among all groups is still possible if each local community translates its own data into the common language for use by others - who will then be able to use it even if they have no idea who created the information, or for what purpose. Triple stores created by a local group may be precise if the vocabulary is carefully controlled by common agreement. For larger communities, such as that served by WikiPedia, there is little chance of gaining agreement on a single common terminology **for all terms**. The latter qualification is crucial. What is actually needed is not wide agreement on a massive terminology of hundreds of thousands of terms, but only on a basic **defining vocabulary** of a few thousand terms that is sufficient to describe accurately any specialized concept one would want to define. Learning to use such a terminology (or an associated ontology) will be comparable in effort to developing a working knowledge of a second language. In effect, in any given community that generates data, it is necessary to have at least one person who is "bilingual" in the local terminology and in the common ontology. This is perfectly feasible, if one has the motivation. I have been concerned with this tactic for database interoperability for a number of years. A discussion of the issue is given in a recent paper: [[Obrst, Leo; Pat Cassidy. 2011. The Need for Ontologies: Bridging the Barriers of Terminology and Data Structures. Chapter 10 (pp. 99 - 124) in: Societal Challenges and Geoinformatics, Sinha, A. Krishna, David Arctur, Ian Jackson, and Linda Gundersen, eds.. Publication of a Memoir Volume. Geological Society of America (GSA). (available at: http://micra.com/papers/OntologiesForInteroperability.pdf)]] If there is any prospect or hope that the formalization of knowledge envisioned by the DBpedia project will ultimately be used for automated reasoning, it is important that effort be made at an early stage to be sure that there is a proper foundation for accurate representation and avoidance of ambiguity. I can help in this task, and will be happy to do so if others in the community are willing to make an effort to do the kind of careful work required. If, on the other hand, it is expected that only probabilistic information will be extracted from queries on the DBpedia database, suitable only for inspection by potential human users, then such care in formalization may not be required. But it would still be helpful, and wouldn't add a lot of work to what is being done. The main effort is in carefully specifying the meanings of the relations being used, to avoid ambiguity and duplication. [TM] > Another way to approach this would be the MCC/CYC approach. It'll > take billions of dollars and you'll need to wait many decades for them > to finish, but at the end of it all I'm sure you'd have a perfectly > consistent knowledge base. The great advantage of a volunteer community is that it doesn't take a lot of time to get funding, and the expense is mostly born by the volunteers for their own interests and their own views of what may help the public. No funder can impose a set of requirements. We *can* have a perfectly consistent database, and the effort of getting agreement on the **basic vocabulary** is likely to be a great deal less than is commonly supposed, because that vocabulary is not very large. The work done on DBpedia thus far appears to me to be a good start. How to proceed from here depends on the ultimate goals. I am very interested in learning how this community views its future. Pat Patrick Cassidy MICRA Inc. cass...@micra.com 908-561-3416 -----Original Message----- From: Tom Morris [mailto:tfmor...@gmail.com] Sent: Tuesday, December 27, 2011 12:24 PM To: Patrick Cassidy Cc: dbpedia-discussion@lists.sourceforge.net Subject: Re: [Dbpedia-discussion] DBpedia ontology On Mon, Dec 26, 2011 at 7:26 PM, Patrick Cassidy <p...@micra.com> wrote: > I have looked briefly at the DBpedia ontology and it appears to leave > a great deal to be desired in terms of what an ontology is best suited > for: to carefully and precisely define the meanings of terms so that > they can be automatically reasoned with by a computer, to accomplish > useful tasks. I will be willing to spend some time to reorganize the > ontology to make it more logically coherent, if (1) there are any > others who are interested in making the ontology more sound and (2) if > there is a process by which that can be done without a very long drawn-out debate. > > I think that the general notion of formalizing the content of the > WikiPedia a a great idea, but to be useful it has to be done > carefully. It is very easy, even for those with experience, to put > logically inconsistent assertions into an ontology, and even easier to > put in elements that are so underspecified that they are ambiguous to > the point of being essentially useless for automated reasoning. The > OWL reasoner can catch some things, but it is very limited, and unless > a first-order reasoner is used one needs to be exceedingly careful about how one defines the relations. You could create an ontology as it "should" be or you can use an ontology which matches the practices and conventions used by the Wikipedia editors. The latter is going to be messy in many ways, but at least it'll have a large quantity of data to work with. Getting any use out of the former would require you convincing all Wikipedians to adhere to your strict conventions, which seems unlikely to me. Another way to approach this would be the MCC/CYC approach. It'll take billions of dollars and you'll need to wait many decades for them to finish, but at the end of it all I'm sure you'd have a perfectly consistent knowledge base. Tom ---------------------------------------------------------------------------- -- Write once. Port to many. Get the SDK and tools to simplify cross-platform app development. Create new or port existing apps to sell to consumers worldwide. Explore the Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join http://p.sf.net/sfu/intel-appdev _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion ------------------------------------------------------------------------------ Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex infrastructure or vast IT resources to deliver seamless, secure access to virtual desktops. With this all-in-one solution, easily deploy virtual desktops for less than the cost of PCs and save 60% on VDI infrastructure costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion