Re: Advancing the DBpedia ontology
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Sure, the data and the ontology have to line up. However, just because all the windmills in Wikipedia happen to be buildings doesn't mean that windmill should be subcategory of building in DBpedia. Similarly, if the DBpedia class Church is a subcategory of buildings then there is pressure to consider a church to be a building. Some of this is just (the perjorative sense of) semantics. What is wrong with defining Church to be a building that is also a place of Christian worship? That's why I suggested that DBpedia classes be tied to Wikipedia articles. (Wikipedia does identify churches with buildings, but at least using this the informal definition of a church would let DBpedia contributors know what a DBpedia church should be.) peter On 02/24/2015 08:15 PM, Vladimir Alexiev wrote: From: Peter F. Patel-Schneider [mailto:pfpschnei...@gmail.com] I agree that there are problems with the mappings. However, how can the mappings be fixed without fixing the ontology? I could ask you a converse question: ** how can you make an accurate ontology without looking at the data? And to look at the data, you need mappings (if not to execute then to document what you've examined). But more constructively: There is a large number of mapping problems independent of the ontology. E.g. when a Singer (Person) is mapped to Band (Organisation) due to wrong check of a field background, I don’t care how the classes are organized, I already hurt that the direct type is wrong. Of course, having a good ontology would help! E.g. https://github.com/dbpedia/mappings-tracker/issues/49: some guy named Admin made in 2010 two props occupation and personFunction with nearly identical role history. - No documentation of course. - occupation has 100-250 uses, personFunction has 20-50 uses. - Which of the two to use? - More importantly, which have already been used right, and which are wrong? I suspect that most uses of occupation are as a DataProp, even though it's declared as an ObjectProp. DBpedia adopts an Object/DataProp Dichotomy that IMHO does not work well. See http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html#sec-3-2 -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU71nsAAoJECjN6+QThfjzujsIAM+nsI/QW4WOfT08OEWaBNvc pVhATh4Tyo/vOeLYWkUE9Cus53iWb7YFW/LEUclrD4rqvfUJ1i5pe/BkKqd9EIUf SaZl2d+uAV+BJ3cIto/JRdQ79eiwQLWTdcmIFdP37+1+ksVPsyIKZsS44fLs5KSa nxTr3t2EPnhvtAEbM2VQadNFDgrdqeze6o9QCNRuFyU7haZudbz1xEelwlzESPWw irfzxqt9CAkppY775jb8APzXfe6M6WvJlHStrgDyIXOl5nWdpRw1WSzp4zK69F1C h6Ar0kn1mbSonMaH6NKK4EeWXAzqpQdRVC0v3HJdfYMv6cvDQul4KccF6ZZsxTg= =nea2 -END PGP SIGNATURE-
Re: [Dbpedia-ontology] [Dbpedia-discussion] Advancing the DBpedia ontology
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I agree that there are problems with the mappings. However, how can the mappings be fixed without fixing the ontology? peter On 02/18/2015 05:03 AM, Vladimir Alexiev wrote: Hi everyone! My presentations from the Dublin meeting are at - http://VladimirAlexiev.github.io/pres/20150209-dbpedia/add-mapping-long.html An example of adding a mapping, while making a couple props along the way and reporting a couple of problems. - http://VladimirAlexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html Provides a wider perspective that data problems are not only due to the ontology, but many other areas. 3. Mapping Language Issues 4. Mapping Server Deficiencies 5. Mapping Wiki Deficiencies 6. Mapping Issues 7. Extraction Framework Issues 8. External Mapping Problems 9. Ontology Problems Almost all of these are also reported in the two trackers described in sec.2 Heiko Paulheim he...@informatik.uni-mannheim.de wrote: I am currently working with Aldo Gangemi on exploiting the mappings to DOLCE (and the high level disjointness axioms in DOLCE) for finding modeling issues both in the instances and the ontology. Sounds very interesting! I've been quite active in the last couple of months, but I've been pecking at random here and there. More systematic approaches are definitely needed, as soon as they are not limited to a theoretical experiment, or a one-time effort that's quickly closed down. I've observed many error patterns, and if people smarter than me can devise ways to leverage and amplify these observations using algorithmic or ML approaches, that could create fast progress. I give some examples of the Need for Research of specific problems: http://VladimirAlexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html#sec-6-5 and next section. Harald Sack: apply the DBpedia ontology to detect inconsistencies and flaws in DBpedia facts. This should not only be possible in a retroactive way, but should take place much earlier. Besides the detection of inconsistencies during the mapping process or afterwards in the extracted data Sounds very promising! If I can help somehow with manual ontological wiki labor, let me know. Data vs ontology validation can provide - mapping defect lists - useful hints that the Extraction can use. The most important feature would be Use Domain Range to Guide Extraction http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html#sec-7-4 this could already be possible right from the start when the user is changing the wikipedia infobox content (in the sense of type checking for domain/range, checking of class disjointness and further constraints, plausibility check for dates I'm doubtful of the utility of error lists to Wikipedia (or it needs to be done with skill and tact): 1. The mapping wiki adopts an Object vs DataProp Dichotomy (uses owl:ObjectProperty and owl:DatatypeProperty and never rdf:Property). But MANY Wikipedia fields include both links and text, and in many cases BOTH are useful http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html#sec-3-2 2. At the end of the day, Wikipedia is highly-crafted text, so telling Wikipedia editors that they can't write something, will not sit well with them. For example, who should resolve this contradiction: DBO: dbo:parent rdfs:range dbo:Person Wikipedia: | mother = [[Queen Victoria]] of [[England]] I think the Extraction Framework (by filtering out the link that is not Person), not Wikipedians a tool that makes inconsistencies/flaws in wikipedia data visible directly in the wikipedia interface, where users could either correct them or confirm facts that are originally in doubt. But Wikipedia is moving towards using Wikidata props in template fields: through {{#property}}. Cheers! -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk ___ Dbpedia-ontology mailing list dbpedia-ontol...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-ontology -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU5JtYAAoJECjN6+QThfjzf0MH/3Heorv9dKd1jx7V72an1SPK Ng0ODtLR1R0/NW6r5ge8tpu0IKt8VG/XdhzcBcycxwfAJZvkMVpeqyo5fvC6yKPQ YP4gq223lys/NccrCOS68t64Y2r2wVh7rQR6q7XI9HUJVkN3sFqk68UzKpvV0K0f 1b5QCOc7Cu+h1iUvKgl+/AnVvtveyRAeCX35bSLAjEyX41DrYx2rB8/xr7KaJMbB XOP11dJvPp7xu0bTrAapW+TVA+WT16XQxo/nwap33tUWmDCikD3cIYOY2TpwD5Er
Re: [Dbpedia-discussion] Advancing the DBpedia ontology
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Is there going to be the possibility of at least listening in remotely? peter On 01/24/2015 12:02 AM, Dimitris Kontokostas wrote: Hi Peter, ATM I can only answer for disjointness axioms. We plan to use them for cleaning up extracted data so we definitely want them. For the rest, we are open to suggestions and this is one of the reasons we invite ontology experts to participate. Best, Dimitris On Sat, Jan 24, 2015 at 5:47 AM, Peter F. Patel-Schneider pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com wrote: Good points all, I think. As well I would like to know what expressive power is being considered for the ontology language. For example, will disjointness axioms be allowed, or local ranges, or constraints? peter On 01/23/2015 12:23 PM, Nicolas Torzec wrote: Hi Dimitris et al., A) What is the specific use you have in mind? B) Are you thinking about a centralized ontology managed by editors, a user-contributed ontology, or an automatically generated taxonomy? C) How will it relate to other ontologies, taxonomies and schemas? Also, will it relate to Wikidata, Wikipedia, schema.org http://schema.org, Facebook OG, etc. D) How will you categorize Wiki pages (and possibly other documents) against this ontology? Cheers. -N. -- Nicolas Torzec Yahoo Labs. -- New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet ___ Dbpedia-discussion mailing list dbpedia-discuss...@lists.sourceforge.net mailto:dbpedia-discuss...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion -- Dimitris Kontokostas Department of Computer Science, University of Leipzig Research Group: http://aksw.org Homepage:http://aksw.org/DimitrisKontokostas -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJUw801AAoJECjN6+QThfjzd3gH/0zre80KqwpW2c/8Ugwd+KpU T94TmAmp+wKmyp5zLzvjXWPETLsnf9tPhTgK+tdPJBLqBsK+3aRO8pEA2h7oFpFy fJ//Jj7hOLEzg7/MkMRl3twnvylK3V1SybR2S/QIBjTuRPcRfRl5guxKd9L2yIIP ANAWtHpwb9RHepG/+E4GWIdeONc2QeaZp4Pf4siWDgKva/SKxMMn4XObFAJ/TY/H uAsXsqboou0mr8i1FyMt0CWoTmOZD7Ki3obZ/zQiwiV/CkBurZ+/tCbjN7dLKAbT iMRK7r5M0GN5Ow56rSwe5Y9uro3Pm1UIl/JohdTWExP04YLd4e1hC+p8qn+91lU= =ocPv -END PGP SIGNATURE-
Re: [Dbpedia-discussion] Advancing the DBpedia ontology
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Good points all, I think. As well I would like to know what expressive power is being considered for the ontology language. For example, will disjointness axioms be allowed, or local ranges, or constraints? peter On 01/23/2015 12:23 PM, Nicolas Torzec wrote: Hi Dimitris et al., A) What is the specific use you have in mind? B) Are you thinking about a centralized ontology managed by editors, a user-contributed ontology, or an automatically generated taxonomy? C) How will it relate to other ontologies, taxonomies and schemas? Also, will it relate to Wikidata, Wikipedia, schema.org, Facebook OG, etc. D) How will you categorize Wiki pages (and possibly other documents) against this ontology? Cheers. -N. -- Nicolas Torzec Yahoo Labs. -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJUwxUBAAoJECjN6+QThfjzAVgH/RPU8ZF8leU0LXDUjpsqko6W urUFwo9BPT41YlNMkFN8CvQzWFVEFTVDIgW/V7YXFZW7xNBxz0UocsYypMV5w4l1 QywdjgtLnyIyWqbfL2MKsELChyPVzHy+mt3A6LHPYlICoIWsYET8odcGaJyc+bkZ USIAeHxAJPBH1UU1E+G1cmRZzXxAcp3hXQTB6HOnkk55Gx/zAjETUDIzYopwszS1 j0WewUWEM6eMWOUYVec5Lgf6It3az6v8KHLNG/lxvq9R/1xETYZ2lXtdjiZF8VpS Z7j+ojzt2QIm/sXMam3kXPTC14xTwVfqwimGs07RsE83a8wzXn7m3pFXYwtHip8= =C11o -END PGP SIGNATURE-
Re: [Dbpedia-discussion] Advancing the DBpedia ontology
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Good points all, I think. As well I would like to know what expressive power is being considered for the ontology language. For example, will disjointness axioms be allowed, or local ranges, or constraints? peter On 01/23/2015 12:23 PM, Nicolas Torzec wrote: Hi Dimitris et al., A) What is the specific use you have in mind? B) Are you thinking about a centralized ontology managed by editors, a user-contributed ontology, or an automatically generated taxonomy? C) How will it relate to other ontologies, taxonomies and schemas? Also, will it relate to Wikidata, Wikipedia, schema.org, Facebook OG, etc. D) How will you categorize Wiki pages (and possibly other documents) against this ontology? Cheers. -N. -- Nicolas Torzec Yahoo Labs. -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJUwxXRAAoJECjN6+QThfjzHqgH/2EJ4W+ESivc2lINp4JwyI+/ yaFg/+3714GPQSvApPEXuEAW5ssHRhBU68Lq90TcFm+fTjazb0Oc36qy0granfLe hFYAojBUKP9ZLDnEkiAxheeXPHrRdTK+yeJQI9/IuCpnrxkJsxL2b65Q1Oz0SSS+ DR1qVrDEk5l4JC9oSWiy6UAAs0aJRyxktqV1gYw/PcCY1mE/yEgtL6PQD7K9TkGW RD0N2jWCI0zsAsp45P2RLC1aNNQ2KMlu6AiHlBm2REvlk3zaMc3liUX+rOGkL7wO aBCFI0gkCkbfooHiVnjvPvKZPy0WJkFaqbP3n8OzbCkaq+tnDr4F9lzCP74nvSM= =WQQm -END PGP SIGNATURE-
Re: scientific publishing process (was Re: Cost and access)
Done. The goal of a new paper-preparation and display system should, however, be to be better than what is currently available. Most HTML-based solutions do not exploit the benefits of HTML, strangely enough. Consider, for example, citation links. They generally jump you to the references section. They should instead pop up the reference, as is done in Wikipedia. Similarly for links to figures. Instead of blindly jumping to the figure, they should do something better, perhaps popping up the figure or, if the figure is already visible, just highlighting it. I have put in both of these as issues. peter On 10/08/2014 03:18 AM, Sarven Capadisli wrote: On 2014-10-07 15:44, Peter F. Patel-Schneider wrote: Well, I remain totally unconvinced that any current HTML solution is as good as the current PDF setup. Certainly htlatex is not suitable. There may be some way to get tex4ht to do better, but no one has provided a solution. Sarven Capadisli sent me some HTML that looks much better, but even on a math-light paper I could see a number of glitches. I haven't seen anything better than that. Would you mind creating an issue for the glitches that you are experiencing? https://github.com/csarven/linked-research/issues Please mention your environment and the documents you've looked at. Also keep in mind the LNCS and ACM SIG authoring guidelines. The purpose of the LNCS and ACM CSS is to adhere to the authoring guidelines so that the the generated PDF file or print output looks as expected (within reason). Much appreciated! -Sarven http://csarven.ca/#i
Re: scientific publishing process (was Re: Cost and access)
On 10/08/2014 05:31 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: PLOS is an interesting case. The HTML for PLOS articles is relatively readable. However, the HTML that the PLOS setup produces is failing at math, even for articles from August 2014. As well, sometimes when I zoom in or out (so that I can see the math better) Firefox stops displaying the paper, and I have to reload the whole page. Interesting bug that. Worth reporting to PLoS. PLoS doesn't appear to have a bug reporting system in place. Even their general assistance email is obsfucated. I sent them a message anyway. Strangely, PLOS accepts low-resolution figures, which in one paper I looked at are quite difficult to read. Yep. Although, it often provides several links to download higher res images, including in the original file format. Quite handy. In this case, even the original was low resolution. However, maybe the PLOS method can be improved to the point where the HTML is competitive with PDF. Indeed. For the moment, HTML views are about 1/5 of PDF. Partly this is because scientists are used to viewing in print format, I suspect, but partly not. I'm hoping that, eventually, PLoS will stop using image based maths. I'd like to be able to zoom maths independently, and copy and paste it in either mathml or tex. Mathjax does this now already. I would suggest that this should have been one of their highest priorities. Phil peter
Re: scientific publishing process (was Re: Cost and access)
If you mean that published papers have to be in PDF, but that they can optionally have a second format, then I had no problem with this proposal. I also have no problem with encouraging use of other formats. However, this is an added burden on conference organizers. Someone would have to volunteer to handle the extra work, particularly the work involved in checking that papers using the second format abide by the publishing requirements. peter On 10/07/2014 05:52 AM, Robert Stevens wrote: What I'd suggest for conference organisers is something like the following: 1. Keep the PDF as the main thing, as it's not going anywhere soon. 3. Also allow submission in some alternative form, including semantic content, and have the conference run a competition for alternative publishing forms - including voting by delegates on what they like and what they want. this could promote such alternative forms and offer a migration route over time. Robert. On 07/10/2014 13:27, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: So, you believe that there is an excellent set of tools for preparing, reviewing, and reading scientific publishing. Package them up and make them widely available. If they are good, people will use them. Convince those who run conferences. If these people are convinced, then they will allow their use in conferences or maybe even require their use. Is that not the point of the discussion? Unfortuantely, we do not know why ISWC and ESWC insist on PDF. I'm not convinced by what I'm seeing right now, however. Sure, but at least the discussion has meant that you have looked at some of the tools again. That's no bad thing. My question would be, are more convinced than you were last time you looked or less? Phil
Re: scientific publishing process (was Re: Cost and access)
On 10/07/2014 05:27 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: So, you believe that there is an excellent set of tools for preparing, reviewing, and reading scientific publishing. Package them up and make them widely available. If they are good, people will use them. Convince those who run conferences. If these people are convinced, then they will allow their use in conferences or maybe even require their use. Is that not the point of the discussion? Not at all. Where was the proposal to put together something that met the requirements of preparing, reviewing, and publishing scientific papers? To me, the initial discussion was about how much better HTML was for carrying data. Other aspects of paper preparation, review, and publishing were not being considered. Now, maybe, aspects of presentation and review and ease of use are part of the discussion. A change in the paper submission process needs to take into account what the paper submission process is about, not just some aspect of what might be included in submitted papers. Unfortuantely, we do not know why ISWC and ESWC insist on PDF. As far as I am concerned, ISWC and ESWC insist on PDF for submissions because the reviewing process is so much better with PDF than with anything else. I'm not convinced by what I'm seeing right now, however. Sure, but at least the discussion has meant that you have looked at some of the tools again. That's no bad thing. My question would be, are more convinced than you were last time you looked or less? Well, I remain totally unconvinced that any current HTML solution is as good as the current PDF setup. Certainly htlatex is not suitable. There may be some way to get tex4ht to do better, but no one has provided a solution. Sarven Capadisli sent me some HTML that looks much better, but even on a math-light paper I could see a number of glitches. I haven't seen anything better than that. It's not as if the basics (MathML, CSS, etc.) are unavailable to put together most, or maybe even all, of an HTML-based solution. These basics have been around for some time now. However, I haven't seen a setup that is as good as LaTeX and PDF for preparation, review, and publishing of scientific papers. Yes, it took a lot of effort to get to the current state with respect to LaTeX and PDF. In the past, I experienced quite a number of problems with using LaTeX and PDF for writing, reviewing, and publishing scientific papers, but most of these are in the past. Yes, there are still some problems with using LaTeX and PDF. Produce something better and people will use it, eventually. Phil peter
Re: scientific publishing process (was Re: Cost and access)
On 10/07/2014 05:23 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: On 10/06/2014 11:00 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: On 10/06/2014 09:32 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: Who cares what the authors intend? I mean, they are not reading the paper, are they? For reviewing, what the authors intend is extremely important. Having different rendering of the paper interfere with the authors' message is something that should be avoided at all costs. Really? So, for example, you think that a reviewer with impared vision should, for example, be forced to review a paper using the authors rendering, regardless of whether they can read it or not? No, but this is not what I was talking about. I was talking about interfering with the authors' message via changes from the rendering that the authors' set up. It *is* exactly what you are talking about. Well, maybe I was not being clear, but I thought that I was talking about rendering changes interfering with comprehension of the authors' intent. And if only you had a definition of rendering changes that interfere with authors intent as opposed to just rendering changes. I can guarantee that rendering a paper to speech WILL change at least some of the authors intent because, for example, figures will not reproduce. You state that this should be avoided at all costs. I think this is wrong. There are many reasons to change rendering. That should be the readers choice. Phil I think that for reviewing the authors should be able to dictate how their submission looks, within the bounds of the submission requirements. If the reviewer wants, or needs, to change the way a submission is presented then it is up to the reviewer to ensure that their review is not coloured by this change. When I review papers I routinely point out presentation problems. Sometimes I take into account presentation problems when I evaluate papers. However, I try very hard to evaluate the submission based on what the authors submitted, not on any changes that I made to the submission. For example, I will point out problems with using colours in graphs, but I will evaluate the paper based on the coloured version of the graphs, not a black and white version. However, if the authors submitted low-resolution figures and something is missing because of this, then I feel free to take this into account in my evaluation. In a situation where I do not know what presentation the authors wanted, for example if explicit line breaks and indentation are sometimes preserved, but not always, the evaluation of submissions can become very much harder. peter
Re: scientific publishing process (was Re: Cost and access)
On 10/07/2014 05:20 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: tex4ht takes the slight strange approach of having an strange and incomprehensible command line, and then lots of scripts which do default options, of which xhmlatex is one. In my installation, they've only put the basic ones into the path, so I ran this with /usr/share/tex4ht/xhmlatex. Phil So someone has to package this up so that it can be easily used. Before then, how can it be required for conferences? http://svn.gnu.org.ua/sources/tex4ht/trunk/bin/ht/unix/xhmlatex Somehow this is not in my tex4ht package. In any case, the HTML output it produces is dreadful. Text characters, even outside math, are replaced by numeric XML character entity references. peter
Re: scientific publishing process (was Re: Cost and access)
Sure, I have lots of papers (none for ESWC, though) that could serve as test cases. peter On 10/07/2014 07:49 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: tex4ht takes the slight strange approach of having an strange and incomprehensible command line, and then lots of scripts which do default options, of which xhmlatex is one. In my installation, they've only put the basic ones into the path, so I ran this with /usr/share/tex4ht/xhmlatex. Phil So someone has to package this up so that it can be easily used. Before then, how can it be required for conferences? http://svn.gnu.org.ua/sources/tex4ht/trunk/bin/ht/unix/xhmlatex Somehow this is not in my tex4ht package. In any case, the HTML output it produces is dreadful. Text characters, even outside math, are replaced by numeric XML character entity references. So, I am willing to spend some time getting this to work. I would like to plug some ESWC papers into tex4ht, to get some HTML which works plain and also with Sarven's templates so that it *looks* like a PDF. Would you be willing to a) try it and b) give worked and short test cases for things that do not work? Phil
Re: scientific publishing process (was Re: Cost and access)
PLOS is an interesting case. The HTML for PLOS articles is relatively readable. However, the HTML that the PLOS setup produces is failing at math, even for articles from August 2014. As well, sometimes when I zoom in or out (so that I can see the math better) Firefox stops displaying the paper, and I have to reload the whole page. Strangely, PLOS accepts low-resolution figures, which in one paper I looked at are quite difficult to read. However, maybe the PLOS method can be improved to the point where the HTML is competitive with PDF. peter This makes me think of PLoS. For example, PLoS has a published format guidelines using Work and Latex (http://www.plosone.org/static/guidelines), a workflow for semantically structuring their resulting output and their final output is well structured and available in XML based on a known standard (http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd), PDF and the published HTML on their website (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011233). This results In semantically meaningful XML that is transformed to HTML http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0011233representation=XML http://www.plosone..org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0011233representation=XML Interestingly as well, they have provided this framework in an open source form: http://www.ambraproject.org/ Clearly the publication process can support a semantic solution and when its in the best interest of the publisher. They will adopt and drive their own markup processes to meet external demand. Providing tools that both the publisher and the author may use independently could simplify such an effort, but is not a main driver in achieving that final result you see in PLoS. This is especially the case given even the debate concerning file formats here. For PLoS, the solution that is currently successful is the one that worked to solve todays immediate local need with todays tools. Cheers, Mark p.s. Finally, on the reference of moving repositories such as EPrints and DSpace towards supporting semantic markup of their contents. Being somewhat of a participant in LoD on the DSpace side, I note that these efforts are inherently just Repository Centric, describing the the structure of the repository (IE Collections of Items), not the semantic structure contained within the Item contents (articles, citations, formulas, data tables, figures, ideas). In both platforms, these capabilities are in their infancy, lacking any rendering other than to offer the original file for download, they ultimately suffer from the absence of semantic structure in the content going into them. -- Mark R. Diggory
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 04:15 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: One problem with allowing HTML submission is ensuring that reviewers can correctly view the submission as the authors intended it to be viewed. How would you feel if your paper was rejected because one of the reviewers could not view portions of it? At least with PDF there is a reasonably good chance that every paper can be correctly viewed by all its reviewers, even if they have to print it out. I don't think that the same claim can be made for HTML-based systems. I don't think this is a valid point. It is certainly possible to write HTML that will not be look good on every machine, but these days, it is easier to write HTML that does. The same is true with PDF. Font problems used to be routine. And, as other people have said, it's very hard to write a PDF that looks good on anything other than paper. My aesthetics are different. I routinely view PDFs on my laptop, and find that they indeed look great. As I said before, I prefer PDF to HTML for viewing of just about any technical material on my computers. Yes, on limited displays two-column PDF may not be viewable at all. Single-column PDF should look good on displays with resolution of HD or better. When I view HTML documents, even the ones I have written, I have to do a lot of adjusting to get something that looks even half-decent on the screen. And when I print HTML documents, the result is invariably bad, and often very bad. However, my point was not about looking good. It was about being able to see the paper in the way that the author intended. My experience is that this is generally possible with PDF, but generally not possible with HTML. I do write papers with considerable math in them, so my experience may not be typical, but whenever I have tried to produce HTML versions of my papers, I have ended up quite frustrated because even I cannot get them to display the way I want them to. It may be that there are now good tools for producing HTML that carries the intent of the author. htlatex has been mentioned in this thread. A solution that uses htlatex would have the benefit of building on much of the work that has been done to make latex a reasonable technology for producing papers. If someone wants to create the necessary infrastructure to make htlatex work as well as pdflatex does, then feel free. Further, why should there be any technical preference for HTML at all? (Yes, HTML is an open standard and PDF is a closed one, but is there anything else besides that?) Web conference vitally use the web in their reviewing and publishing processes. Doesn't that show their allegiance to the web? Would the use of HTML make a conference more webby? PDF is, I think, open these days. But, yes, I do think that conferences should dog food. I mean, what would you think if W3C produced all of their documents in PDF? Would that make sense? Actually, I would have been very happy if W3C had produced all its technical documents in PDF. It would have made my life much easier. Phil peter
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 04:27 AM, Phillip Lord wrote: [On using htlatex for conferences.] So, as well as providing a LNCS stylesheet, we'd need a htlatex cf.cfg, and one CSS and it's done. Be good to have another CSS for on-screen viewing; LNCS's back of a postage stamp is very poor for that. Phil I would be totally astonished if using htlatex as the main way to produce conference papers were as simple as this. I just tried htlatex on my ISWC paper, and the result was, to put it mildly, horrible. (One of my AAAI papers was about the same, the other one caused an undefined control sequence and only produced one page of output.) Several parts of the paper were rendered in fixed-width fonts. There was no attempt to limit line length. Footnotes were in separate files. Many non-scalable images were included, even for simple math. My carefully designed layout for examples was modified in ways that made the examples harder to understand. The footnotes did not show up at all in the printed version. That said, the result was better than I expected. If someone upgrades htlatex to work well I'm quite willing to use it, but I expect that a lot of work is going to be needed. peter
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 08:38 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: I would be totally astonished if using htlatex as the main way to produce conference papers were as simple as this. I just tried htlatex on my ISWC paper, and the result was, to put it mildly, horrible. (One of my AAAI papers was about the same, the other one caused an undefined control sequence and only produced one page of output.) Several parts of the paper were rendered in fixed-width fonts. There was no attempt to limit line length. Footnotes were in separate files. The footnote thing is pretty strange, I have to agree. Although footnotes are a fairly alien concept wrt to the web. Probably hover overs would be a reasonable presentation for this. Many non-scalable images were included, even for simple math. It does MathML I think, which is then rendered client side. Or you could drop math-mode straight through and render client side with mathjax. Well, somehow png files are being produced for some math, which is a failure. I don't know what the way to do this right would be, I just know that the version of htlatex for Fedora 20 fails to reasonably handle the math in this paper. My carefully designed layout for examples was modified in ways that made the examples harder to understand. Perhaps this is a key difference between us. I don't care about the layout, and want someone to do it for me; it's one of the reasons I use latex as well. There are many cases where line breaks and indentation are important for understanding. Getting this sort of presentation right in latex is a pain for starters, but when it has been done, having the htlatex toolchain mess it up is a failure. That said, the result was better than I expected. If someone upgrades htlatex to work well I'm quite willing to use it, but I expect that a lot of work is going to be needed. Which gets us back to the chicken and egg situation. I would probably do this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll end up with the PDF output anyway. Well, I'm with ESWC and ISWC here. The review process should be designed to make reviewing easy for reviewers. Until viewing HTML output is as trouble-free as viewing PDF output, then PDF should be the required format. This is why it is important that web conferences allow HTML, which is where the argument started. If you want something that prints just right, PDF is the thing for you. If you you want to read your papers in the bath, likewise, PDF is the thing for you. And that's fine by me (so long as you don't mind me reading your papers in the bath!). But it needs to not be the only option. Why? What are the benefits of HTML reviewing, right now? What are the benefits of HTML publishing, right now? If there were HTML-based tools that worked well for preparing, reviewing, and reading scientific papers, then maybe conferences would use them. However, conference organizers and reviewers have limited time, and are thus going for the simplest solution that works well. If some group thinks that a good HTML-based solution is possible, then let them produce this solution. If the group can get pre-approval of some conference, then more power to them. However, I'm not going to vote for any pre-approval of some future solution when the current situation is satisficing. Phil peter
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 08:29 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: However, my point was not about looking good. It was about being able to see the paper in the way that the author intended. Yes, I understand this. It's not something that I consider at all important, which perhaps represents our different view points. Readers have different preferences. I prefer reading in inverse video; I like to be able to change font size to zoom in and out. I quite like fixed width fonts. Other people like the two column thing. Other people want things read to them. Who cares what the authors intend? I mean, they are not reading the paper, are they? For reviewing, what the authors intend is extremely important. Having different rendering of the paper interfere with the authors' message is something that should be avoided at all costs. Similarly for reading papers, if the rendering of the paper interferes with the authors' message, that is a failure of the process. I do write papers with considerable math in them, so my experience may not be typical, but whenever I have tried to produce HTML versions of my papers, I have ended up quite frustrated because even I cannot get them to display the way I want them to. I've been using mathjax on my website for a long time and it seems to work well, although I am not maths heavy. It may be that there are now good tools for producing HTML that carries the intent of the author. htlatex has been mentioned in this thread. A solution that uses htlatex would have the benefit of building on much of the work that has been done to make latex a reasonable technology for producing papers. If someone wants to create the necessary infrastructure to make htlatex work as well as pdflatex does, then feel free. It's more to make htlatex work as well as lncs.sty works. htlatex produces reasonable, if dull, HTML of the bat. My experience is that htlatex produces very bad output. Phil peter
Re: scientific publishing process (was Re: Cost and access)
It's not hard to query PDFs with SPARQL. All you have to do is extract the metadata from the document and turn it into RDF, if needed. Lots of programs extract and display this metadata already. No, I don't think that viewing this issue from the reviewer perspective is too narrow. Reviewers form a vital part of the scientific publishing process. Anything that makes their jobs harder or the results that they produce worse is going to have to have very large benefits over the current setup. In any case, I haven't been looking at the reviewer perspective only, even in the message quoted below. peter PS: This is *not* to say that I think that the reviewing process is anywhere near ideal. On the contrary, I think that the reviewing process has many problems, particularly as it is performed in CS conferences. On 10/06/2014 09:19 AM, Martynas Jusevičius wrote: Dear Peter, please show me how to query PDFs with SPARQL. Then I'll believe there are no benefits of XHTML+RDFa over PDF. Addressing the issue from the reviewer perspective only is too narrow, don't you think? Martynas On Mon, Oct 6, 2014 at 6:08 PM, Peter F. Patel-Schneider pfpschnei...@gmail.com wrote: On 10/06/2014 08:38 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: I would be totally astonished if using htlatex as the main way to produce conference papers were as simple as this. I just tried htlatex on my ISWC paper, and the result was, to put it mildly, horrible. (One of my AAAI papers was about the same, the other one caused an undefined control sequence and only produced one page of output.) Several parts of the paper were rendered in fixed-width fonts. There was no attempt to limit line length. Footnotes were in separate files. The footnote thing is pretty strange, I have to agree. Although footnotes are a fairly alien concept wrt to the web. Probably hover overs would be a reasonable presentation for this. Many non-scalable images were included, even for simple math. It does MathML I think, which is then rendered client side. Or you could drop math-mode straight through and render client side with mathjax. Well, somehow png files are being produced for some math, which is a failure. I don't know what the way to do this right would be, I just know that the version of htlatex for Fedora 20 fails to reasonably handle the math in this paper. My carefully designed layout for examples was modified in ways that made the examples harder to understand. Perhaps this is a key difference between us. I don't care about the layout, and want someone to do it for me; it's one of the reasons I use latex as well. There are many cases where line breaks and indentation are important for understanding. Getting this sort of presentation right in latex is a pain for starters, but when it has been done, having the htlatex toolchain mess it up is a failure. That said, the result was better than I expected. If someone upgrades htlatex to work well I'm quite willing to use it, but I expect that a lot of work is going to be needed. Which gets us back to the chicken and egg situation. I would probably do this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll end up with the PDF output anyway. Well, I'm with ESWC and ISWC here. The review process should be designed to make reviewing easy for reviewers. Until viewing HTML output is as trouble-free as viewing PDF output, then PDF should be the required format. This is why it is important that web conferences allow HTML, which is where the argument started. If you want something that prints just right, PDF is the thing for you. If you you want to read your papers in the bath, likewise, PDF is the thing for you. And that's fine by me (so long as you don't mind me reading your papers in the bath!). But it needs to not be the only option. Why? What are the benefits of HTML reviewing, right now? What are the benefits of HTML publishing, right now? If there were HTML-based tools that worked well for preparing, reviewing, and reading scientific papers, then maybe conferences would use them. However, conference organizers and reviewers have limited time, and are thus going for the simplest solution that works well. If some group thinks that a good HTML-based solution is possible, then let them produce this solution. If the group can get pre-approval of some conference, then more power to them. However, I'm not going to vote for any pre-approval of some future solution when the current situation is satisficing. Phil peter
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 09:28 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: It does MathML I think, which is then rendered client side. Or you could drop math-mode straight through and render client side with mathjax. Well, somehow png files are being produced for some math, which is a failure. Yeah, you have to tell it to do mathml. The problem is that older versions of the browsers don't render mathml, and image rendering was the only option. Well, then someone is going to have to tell people how to do this. What I saw for htlatex was that it just did the right thing. I don't know what the way to do this right would be, I just know that the There are many cases where line breaks and indentation are important for understanding. Getting this sort of presentation right in latex is a pain for starters, but when it has been done, having the htlatex toolchain mess it up is a failure. Indeed. I believe that there are plans in future versions of HTML to introduce a pre tag which prefers indentation and line breaks. Which gets us back to the chicken and egg situation. I would probably do this; but, at the moment, ESWC and ISWC won't let me submit it. So, I'll end up with the PDF output anyway. Well, I'm with ESWC and ISWC here. The review process should be designed to make reviewing easy for reviewers. I *only* use PDF when reviewing. I never use it for viewing anything else. I only use it for reviewing since I am forced to. Experiences differ, so I find this a far from compelling argument. It may not be a compelling argument when choosing between two new alternatives, but it is much more compelling argument against change. This is why it is important that web conferences allow HTML, which is where the argument started. Why? What are the benefits of HTML reviewing, right now? What are the benefits of HTML publishing, right now? Well, we've been through this before, so I'll not repeat myself. Phil Yes, and I haven't seen any benefits using the current setup. peter
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 09:32 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: Who cares what the authors intend? I mean, they are not reading the paper, are they? For reviewing, what the authors intend is extremely important. Having different rendering of the paper interfere with the authors' message is something that should be avoided at all costs. Really? So, for example, you think that a reviewer with impared vision should, for example, be forced to review a paper using the authors rendering, regardless of whether they can read it or not? No, but this is not what I was talking about. I was talking about interfering with the authors' message via changes from the rendering that the authors' set up. Of course, this is an extreme example, although not an unrealistic one. It is fundamentally any different from my desire as I get older to be able to change font size and refill paragraphs with ease. I see a difference of scale, that is all. I see these as completely different. There are some aspects of rendering that generally do not interfere with intent. There are other aspects of rendering that can easily interfere with intent. Similarly for reading papers, if the rendering of the paper interferes with the authors' message, that is a failure of the process. Yes, I agree. Which is why, I believe, that the rendering of a paper should be up to the reader As this is why I believe that the authors' should be able to specify the rendering of their paper to the extent that they feel is needed to convey the intent of the paper. . Phil peter
Re: scientific publishing process (was Re: Cost and access)
I don't think that scanning a printout retains any metadata that was in the electronic source so, no, this would not follow using the same logic. I do agree that dissemination of results is one of the most important parts of the scientific process. The argument here is, I think, what is the best way to support dissemination. Eating your own dog food, is a separate matter, I think. Eating your own dog good may help with uptake, but on the other hand it may interfere with dissemination, by making preparation of papers harder or making them harder to review or read. peter On 10/06/2014 10:09 AM, Martynas Jusevičius wrote: Following the same logic, we still could have been using paper submissions? All you have to do is to scan them to turn them into PDFs. It's been a while since I was in the university, but wasn't dissemination an important part of science? What about dogfooding after all? Martynas On Mon, Oct 6, 2014 at 6:48 PM, Peter F. Patel-Schneider pfpschnei...@gmail.com wrote: It's not hard to query PDFs with SPARQL. All you have to do is extract the metadata from the document and turn it into RDF, if needed. Lots of programs extract and display this metadata already. No, I don't think that viewing this issue from the reviewer perspective is too narrow. Reviewers form a vital part of the scientific publishing process. Anything that makes their jobs harder or the results that they produce worse is going to have to have very large benefits over the current setup. In any case, I haven't been looking at the reviewer perspective only, even in the message quoted below. peter PS: This is *not* to say that I think that the reviewing process is anywhere near ideal. On the contrary, I think that the reviewing process has many problems, particularly as it is performed in CS conferences. On 10/06/2014 09:19 AM, Martynas Jusevičius wrote: Dear Peter, please show me how to query PDFs with SPARQL. Then I'll believe there are no benefits of XHTML+RDFa over PDF. Addressing the issue from the reviewer perspective only is too narrow, don't you think? Martynas [...]
Re: scientific publishing process (was Re: Cost and access)
Sure. So extract the text from the PDF and query that. It also would be nice to have access to the LaTeX sources. What HTML publishing *might* have that is better than the above is to more easily embed some extra information into papers that can be queried. Is this just metadata that could also be easily injected into PDFs? If given this capability will a significant number of authors use it? Is it instead better to have a separate document that has the information and not use HTML for publishing? peter On 10/06/2014 10:42 AM, Alexander Garcia Castro wrote: It's not hard to query PDFs with SPARQL. All you have to do is extract the metadata from the document and turn it into RDF, if needed. Lots of programs extract and display this metadata already. in the age of the web of data why should I restrict my search just to metadata? I want the full content, open access or not once I have the document I should be able to mine the content of the document. I dont want to limit my search just to simple metadata. On Mon, Oct 6, 2014 at 9:48 AM, Peter F. Patel-Schneider pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com wrote: It's not hard to query PDFs with SPARQL. All you have to do is extract the metadata from the document and turn it into RDF, if needed. Lots of programs extract and display this metadata already. No, I don't think that viewing this issue from the reviewer perspective is too narrow. Reviewers form a vital part of the scientific publishing process. Anything that makes their jobs harder or the results that they produce worse is going to have to have very large benefits over the current setup. In any case, I haven't been looking at the reviewer perspective only, even in the message quoted below. peter PS: This is *not* to say that I think that the reviewing process is anywhere near ideal. On the contrary, I think that the reviewing process has many problems, particularly as it is performed in CS conferences. On 10/06/2014 09:19 AM, Martynas Jusevičius wrote: Dear Peter, please show me how to query PDFs with SPARQL. Then I'll believe there are no benefits of XHTML+RDFa over PDF. Addressing the issue from the reviewer perspective only is too narrow, don't you think? Martynas On Mon, Oct 6, 2014 at 6:08 PM, Peter F. Patel-Schneider pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com wrote: On 10/06/2014 08:38 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com writes: I would be totally astonished if using htlatex as the main way to produce conference papers were as simple as this. I just tried htlatex on my ISWC paper, and the result was, to put it mildly, horrible. (One of my AAAI papers was about the same, the other one caused an undefined control sequence and only produced one page of output.) Several parts of the paper were rendered in fixed-width fonts. There was no attempt to limit line length. Footnotes were in separate files. The footnote thing is pretty strange, I have to agree. Although footnotes are a fairly alien concept wrt to the web. Probably hover overs would be a reasonable presentation for this. Many non-scalable images were included, even for simple math. It does MathML I think, which is then rendered client side. Or you could drop math-mode straight through and render client side with mathjax. Well, somehow png files are being produced for some math, which is a failure. I don't know what the way to do this right would be, I just know that the version of htlatex for Fedora 20 fails to reasonably handle the math in this paper. My carefully designed layout for examples was modified in ways that made the examples harder to understand. Perhaps this is a key difference between us. I don't care about the layout, and want someone to do it for me; it's one of the reasons I use latex as well. There are many cases where line breaks and indentation are important for understanding. Getting this sort of presentation right in latex is a pain for starters, but when it has been done
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 10:44 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: On 10/06/2014 09:28 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: It does MathML I think, which is then rendered client side. Or you could drop math-mode straight through and render client side with mathjax. Well, somehow png files are being produced for some math, which is a failure. Yeah, you have to tell it to do mathml. The problem is that older versions of the browsers don't render mathml, and image rendering was the only option. Well, then someone is going to have to tell people how to do this. What I saw for htlatex was that it just did the right thing. So, htlatex is part of TeX4Ht which does HTML. If you do xhmlatex then you get XHTML with, indeed, math mode in MathML. So, for example, this output comes with the default xhmlatex. math xmlns=http://www.w3.org/1998/Math/MathML; display=inline mi e/mi mo class=MathClass-rel=/mo mi m/mimsupmrow mi c/mi/mrowmrow mn2/mn/mrow/msup /math tex4ht takes the slight strange approach of having an strange and incomprehensible command line, and then lots of scripts which do default options, of which xhmlatex is one. In my installation, they've only put the basic ones into the path, so I ran this with /usr/share/tex4ht/xhmlatex. Phil So someone has to package this up so that it can be easily used. Before then, how can it be required for conferences? I have tex4ht installed, but there is no xhmlatex file to be found. I managed to find what appears to be a good command line htlatex schema-org-analysis.tex xhtml,mathml -cunihtf -cvalidate This looks better when viewed, but the resultant HTML is unintelligible. There is definitely more work needed here before this can be considered as a potential solution. peter
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 11:00 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: On 10/06/2014 09:32 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: Who cares what the authors intend? I mean, they are not reading the paper, are they? For reviewing, what the authors intend is extremely important. Having different rendering of the paper interfere with the authors' message is something that should be avoided at all costs. Really? So, for example, you think that a reviewer with impared vision should, for example, be forced to review a paper using the authors rendering, regardless of whether they can read it or not? No, but this is not what I was talking about. I was talking about interfering with the authors' message via changes from the rendering that the authors' set up. It *is* exactly what you are talking about. Well, maybe I was not being clear, but I thought that I was talking about rendering changes interfering with comprehension of the authors' intent. peter [...]
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 11:00 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: On 10/06/2014 09:32 AM, Phillip Lord wrote: Peter F. Patel-Schneider pfpschnei...@gmail.com writes: Who cares what the authors intend? I mean, they are not reading the paper, are they? For reviewing, what the authors intend is extremely important. Having different rendering of the paper interfere with the authors' message is something that should be avoided at all costs. Really? So, for example, you think that a reviewer with impared vision should, for example, be forced to review a paper using the authors rendering, regardless of whether they can read it or not? No, but this is not what I was talking about. I was talking about interfering with the authors' message via changes from the rendering that the authors' set up. It *is* exactly what you are talking about. If I want to render your document to speech, then why should I not? What I am saying is that, you, the author, should not wish to constrain the rendering, only really the content. Effectively, if you are using latex, you are already doing this, since latex defines the layout and not you. But, I think we are talking in too abstract a term here. Should you be able to constrain indentation for code blocks? Yes, of course, you should. But, a quick look at the web shows that people do this all the time. Sure, and htlatex appears to interfere with this indentation. At least it does in my ISWC paper. Similarly for reading papers, if the rendering of the paper interferes with the authors' message, that is a failure of the process. Yes, I agree. Which is why, I believe, that the rendering of a paper should be up to the reader As this is why I believe that the authors' should be able to specify the rendering of their paper to the extent that they feel is needed to convey the intent of the paper. For scientific papers, I think this really is not very far. I mean, a scientific paper is not a fashion store; it's a story designed to persuade with data. I would like to see papers which are in the hands of the reader as much as possible. Citation format should be for the reader. Math presentation. Graphs should be interactive and zoomable, with the data underneath as CSV. All of these are possible and routine with HTML now. I want to be free to choose the organisation of my papers so that I can convey what I want. At the moment, I cannot. The PDF is not reasonable for all, maybe not even most of this. But some. Phil So, you believe that there is an excellent set of tools for preparing, reviewing, and reading scientific publishing. Package them up and make them widely available. If they are good, people will use them. Convince those who run conferences. If these people are convinced, then they will allow their use in conferences or maybe even require their use. I'm not convinced by what I'm seeing right now, however. peter
Re: scientific publishing process (was Re: Cost and access)
On 10/06/2014 11:03 AM, Kingsley Idehen wrote: On 10/6/14 12:48 PM, Peter F. Patel-Schneider wrote: It's not hard to query PDFs with SPARQL. All you have to do is extract the metadata from the document and turn it into RDF, if needed. Lots of programs extract and display this metadata already. Peter, Having had 200+ (some-non-rdf-doc} to RDF document transformers built under my direct guidance, there are issues with your claim above: Huh? Every single PDF reader that I use can extract the PDF metadata and display it. The metadata that I see in PDF documents uses a core set of properties that are easy to transform into RDF. Of course, this core set is very small (title, author, and a few other things) so you don't get all that much out of the core set. 1. The extractors are platform specific -- AWWW is about platform agnosticism (I don't want to mandate an OS for experiencing the power of Linked Open Data transformers / rdfizers) Well, the extractors would be specific to PDF, but that's hardly surprising, I think. 2. It isn't solely about metadata -- we also have raw data inside these documents confined to Tables, paragraphs of sentences Well, sure, but is extracting information directly from the figures or tables or text being considered here? I sure would like this to be possible. How would it work in an HTML context? 3. If querying a PDF was marginally simple, I would be demonstrating that using a SPARQL results URL in response to this post :-) I'm not saying that it is so simple. You do have to find the metadata block in the PDF and then look for the /Title, /Author, ... stuff. Possible != Simple and Productive. Yes, but there are lots of tools that display PDF metadata, so there are some who believe that the benefit is greater than the cost. We want to leverage the productivity and simplicity that AWWW brings to data representation, access, interaction, and integration. Sure, but the additional costs, if any, on paper authors, reviewers, and readers have to be considered. If these costs are eliminated or at least minimized then this good is much more likely to be realized. peter
Re: scientific publishing process (was Re: Cost and access)
Neat. This could be extended to putting a full table of contents into the metadata, and in lots of other ways. The other nice thing about it is that it would be possible to push the same data through a LaTeX to HTML toolchain for those who want HTML output. peter On 10/06/2014 03:18 PM, Norman Gray wrote: Greetings. On 2014 Oct 6, at 19:19, Alexander Garcia Castro alexgarc...@gmail.com wrote: querying PDFs is NOT simple and requires a lot of work -and usually produces lots of errors. just querying metadata is not enough. As I said before, I understand the PDF as something that gives me a uniform layout. that is ok and necessary, but not enough or sufficient within the context of the web of data and scientific publications. I would like to have the content readily available for mining purposes. if I pay for the publication I should get access to the publication in every format it is available. the content should be presented in a way so that it makes sense within the web of data. if it is the full content of the paper represented in RDF or XML fine. also, I would like to have well annotated content, this is simple and something that could quite easily be part of existing publication workflows. it may also be part of the guidelines for authors -for instance, identify and annotate rhetorical structures. The following might add something to this conversation. It illustrates getting the metadata from a LaTeX file, putting it into an XMP packet in a PDF, and getting it out of the PDF as RDF. Pace Peter's mention of /Author, /Title, etc, this just focuses on the XMP packet. This has the document metadata, the abstract, and an illustrative bit of argumentation. Adding details about the document structure, and (RDF) pointers to any figures would be feasible, as would, I suspect, incorporating CSV files directly into the PDF. Incorporating \begin{tabular} tables would be rather tricky, but not impossible. I can't help feeling that the XHTML+RDFa equivalent would be longer and need more documentation to instruct the author where to put the RDFa magic. It's not very fancy, and still has rough edges, but it only took me 100 minutes, from a standing start. Generating and querying this PDF seems pretty simple to me. [...]
scientific publishing process (was Re: Cost and access)
In my opinion PDF is currently the clear winner over HTML in both the ability to produce readable documents and the ability to display readable documents in the way that the author wants them to display. In the past I have tried various means to produce good-looking HTML and I've always gone back to a setup that produces PDF. If a document is available in both HTML and PDF I almost always choose to view it in PDF. This is the case even though I have particular preferences in how I view documents. If someone wants to change the format of conference submissions, then they are going to have to cater to the preferences of authors, like me, and reviewers, like me. If someone wants to change the format of conference papers, then they are going to have to cater to the preferences of authors, like me, attendees, like me, and readers, like me. I'm all for *better* methods for preparing, submitting, reviewing, and publishing conference (and journal) papers. So go ahead, create one. But just saying that HTML is better than PDF in some dimension, even if it were true, doesn't mean that HTML is better than PDF for this purpose. So I would say that the semantic web community is saying that there are better formats and tools for creating, reviewing, and publishing scientific papers than HTML and tools that create and view HTML. If there weren't these better ways then an HTML-based solution might be tenable, but why use a worse solution when a better one is available? peter On 10/03/2014 08:02 AM, Phillip Lord wrote: [...] As it stands, the only statement that the semantic web community are making is that web formats are too poor for scientific usage. [...] Phil
Re: scientific publishing process (was Re: Cost and access)
One problem with allowing HTML submission is ensuring that reviewers can correctly view the submission as the authors intended it to be viewed. How would you feel if your paper was rejected because one of the reviewers could not view portions of it? At least with PDF there is a reasonably good chance that every paper can be correctly viewed by all its reviewers, even if they have to print it out. I don't think that the same claim can be made for HTML-based systems. Further, why should there be any technical preference for HTML at all? (Yes, HTML is an open standard and PDF is a closed one, but is there anything else besides that?) Web conference vitally use the web in their reviewing and publishing processes. Doesn't that show their allegiance to the web? Would the use of HTML make a conference more webby? peter On 10/03/2014 09:11 AM, Phillip Lord wrote: In my opinion, the opposite is true. PDF I almost always end up printing out. This isn't the point though. Necessity is the mother of invention. In the ideal world, a web conference would allow only HTML submission. Failing that, at least HTML submission. But, currently, we cannot submit HTML at all. What is the point of creating a better method, if we can't use it? The only argument that seems at all plausible to me is, well, we've always done it like this, and it's too much effort to change. I could appreciate that. Anyway, the argument is going round in circles. Peter F. Patel-Schneider pfpschnei...@gmail.com writes: In my opinion PDF is currently the clear winner over HTML in both the ability to produce readable documents and the ability to display readable documents in the way that the author wants them to display. In the past I have tried various means to produce good-looking HTML and I've always gone back to a setup that produces PDF. If a document is available in both HTML and PDF I almost always choose to view it in PDF. This is the case even though I have particular preferences in how I view documents. If someone wants to change the format of conference submissions, then they are going to have to cater to the preferences of authors, like me, and reviewers, like me. If someone wants to change the format of conference papers, then they are going to have to cater to the preferences of authors, like me, attendees, like me, and readers, like me. I'm all for *better* methods for preparing, submitting, reviewing, and publishing conference (and journal) papers. So go ahead, create one. But just saying that HTML is better than PDF in some dimension, even if it were true, doesn't mean that HTML is better than PDF for this purpose. So I would say that the semantic web community is saying that there are better formats and tools for creating, reviewing, and publishing scientific papers than HTML and tools that create and view HTML. If there weren't these better ways then an HTML-based solution might be tenable, but why use a worse solution when a better one is available? peter On 10/03/2014 08:02 AM, Phillip Lord wrote: [...] As it stands, the only statement that the semantic web community are making is that web formats are too poor for scientific usage. [...] Phil
Re: scientific publishing process (was Re: Cost and access)
On 10/03/2014 10:25 AM, Diogo FC Patrao wrote: On Fri, Oct 3, 2014 at 1:38 PM, Peter F. Patel-Schneider pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com wrote: One problem with allowing HTML submission is ensuring that reviewers can correctly view the submission as the authors intended it to be viewed. How would you feel if your paper was rejected because one of the reviewers could not view portions of it? At least with PDF there is a reasonably good chance that every paper can be correctly viewed by all its reviewers, even if they have to print it out. I don't think that the same claim can be made for HTML-based systems. The majority of journals I'm familiar with mandates a certain format for submission: font size, figure format, etc. So, in a HTML format submission, there should be rules as well, a standard CSS and the right elements and classes. Not different from getting a word(c) or latex template. This might help. However, someone has to do this, and ensure that the result is generally viewable. Web conference vitally use the web in their reviewing and publishing processes. Doesn't that show their allegiance to the web? Would the use of HTML make a conference more webby? As someone said, this is leading by example. Yes, but what makes HTML better for being webby than PDF? dfcp peter
Re: scientific publishing process (was Re: Cost and access)
Does ease of processing make something more webby? If so, LaTeX should be preferred to HTML. peter On 10/03/2014 02:01 PM, john.nj.dav...@bt.com wrote: Yes, but what makes HTML better for being webby than PDF? Because it is a mark-up language (albeit largely syntactic) which makes it much more amenable to machine processing? -Original Message- From: Peter F. Patel-Schneider [mailto:pfpschnei...@gmail.com] Sent: 03 October 2014 21:15 To: Diogo FC Patrao Cc: Phillip Lord; semantic-...@w3.org; public-lod@w3.org Subject: Re: scientific publishing process (was Re: Cost and access) On 10/03/2014 10:25 AM, Diogo FC Patrao wrote: On Fri, Oct 3, 2014 at 1:38 PM, Peter F. Patel-Schneider pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com wrote: One problem with allowing HTML submission is ensuring that reviewers can correctly view the submission as the authors intended it to be viewed. How would you feel if your paper was rejected because one of the reviewers could not view portions of it? At least with PDF there is a reasonably good chance that every paper can be correctly viewed by all its reviewers, even if they have to print it out. I don't think that the same claim can be made for HTML-based systems. The majority of journals I'm familiar with mandates a certain format for submission: font size, figure format, etc. So, in a HTML format submission, there should be rules as well, a standard CSS and the right elements and classes. Not different from getting a word(c) or latex template. This might help. However, someone has to do this, and ensure that the result is generally viewable. Web conference vitally use the web in their reviewing and publishing processes. Doesn't that show their allegiance to the web? Would the use of HTML make a conference more webby? As someone said, this is leading by example. Yes, but what makes HTML better for being webby than PDF? dfcp peter
Re: scientific publishing process (was Re: Cost and access)
Hmm. Are these semantic? All these seem to do is to signal parts of a document. What I would consider to be semantic would be a way of extracting the mathematical content of a document. peter On 10/03/2014 02:32 PM, Diogo FC Patrao wrote: html5 has so-called semantic tags, like header, section. -- diogo patrão On Fri, Oct 3, 2014 at 6:01 PM, john.nj.dav...@bt.com mailto:john.nj.dav...@bt.com wrote: Yes, but what makes HTML better for being webby than PDF? Because it is a mark-up language (albeit largely syntactic) which makes it much more amenable to machine processing? -Original Message- From: Peter F. Patel-Schneider [mailto:pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com] Sent: 03 October 2014 21:15 To: Diogo FC Patrao Cc: Phillip Lord; semantic-...@w3.org mailto:semantic-...@w3.org; public-lod@w3.org mailto:public-lod@w3.org Subject: Re: scientific publishing process (was Re: Cost and access) On 10/03/2014 10:25 AM, Diogo FC Patrao wrote: On Fri, Oct 3, 2014 at 1:38 PM, Peter F. Patel-Schneider pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com mailto:pfpschnei...@gmail.com wrote: One problem with allowing HTML submission is ensuring that reviewers can correctly view the submission as the authors intended it to be viewed. How would you feel if your paper was rejected because one of the reviewers could not view portions of it? At least with PDF there is a reasonably good chance that every paper can be correctly viewed by all its reviewers, even if they have to print it out. I don't think that the same claim can be made for HTML-based systems. The majority of journals I'm familiar with mandates a certain format for submission: font size, figure format, etc. So, in a HTML format submission, there should be rules as well, a standard CSS and the right elements and classes. Not different from getting a word(c) or latex template. This might help. However, someone has to do this, and ensure that the result is generally viewable. Web conference vitally use the web in their reviewing and publishing processes. Doesn't that show their allegiance to the web? Would the use of HTML make a conference more webby? As someone said, this is leading by example. Yes, but what makes HTML better for being webby than PDF? dfcp peter
Re: How to avoid that collections break relationships
On 04/09/2014 12:57 AM, Ruben Verborgh wrote: What then is RDF for you? The Resource Description Framework. It is a framework to describe resources, and this includes predicates. Anybody can define predicates the way they want, otherwise RDF is useless to express semantics. Ok, I describe ex:BaseballPlayer as ex:BaseballPlayer owl:equivalentClass _:x . _:x owl:intersectionOf ( ex:Person [ owl:onProperty ex:plays; owl:hasValue ex:Baseball ] ) Is this RDF? Should all consumers of RDF understand all of this? For example, do you consider N3 to be RDF? No, quantification is not part of RDF. Why not? I could certainly define an encoding of quanfification in RDF and use it to define predicates. Can predicates have non-local effects? A predicate indicates a relationship between an object and a subject. What this relationship means is described in the ontology to which the predicate belongs. Predicates may not influence non-related triples, however, other triples might be influenced through a cascade of relations. Why not? I can define predicates however I want, after all? What does using owl:differentFrom in RDF commit you to? It says that two things are different. Clients that can interpret this predicate can apply its meaning. This application does not change the model. What model? Do you mean that all you care about is the abstract syntax? What about rdf:type? What about rdfs:domain? Do all consumers of RDF need to commit to the standard meaning of these predicates? To me, what RDF does not do is just as important and what it does do. This means that RDF captures only the RDF bit of the meaning of predicates - the rest of their meaning remains inaccessible from RDF. Any attempt to go beyond this is … going beyond RDF and it is very important do realize this. RDF is just the model. Giving a predicate meaning is not extending the model. How so? What else is giving a predicate meaning besides extending the model? Best, Ruben I am really struggling to understand your view of RDF. peter
Re: How to avoid that collections break relationships
On 04/12/2014 05:20 PM, Ruben Verborgh wrote: Hi Peter, Ok, I describe ex:BaseballPlayer as ex:BaseballPlayer owl:equivalentClass _:x . _:x owl:intersectionOf ( ex:Person [ owl:onProperty ex:plays; owl:hasValue ex:Baseball ] ) Is this RDF? Yes. I would say that this is OWL in RDF clothing. Should all consumers of RDF understand all of this? Yes, depending on your interpretation of understand. All of them should parse the triples. This is where RDF ends. This I totally disagree with. RDF is much more than just triples. RDF includes a meaning for triples. Those that can interpret OWL will be able to infer additional things. This is OWL and not part of the RDF model (and thus also not extending the RDF model). h1Baseball player/h1 doesn't extend HTML. It just applies HTML to describe a baseball player. As using ex:BaseballPlayer doesn't extend RDF. However, using owl:disjointWith as a predicate in triples, and expecting it to have some relationship to disjointness of RDF class extensions, is an extension of RDF. No, quantification is not part of RDF. Why not? It is not in the spec. But you appear to be using only part of the RDF spec. Why just that part and not the whole RDF spec? If *you* leave parts out, surely it is just as legitimate for *'me* to add parts. I could certainly define an encoding of quanfification in RDF and use it to define predicates. You indeed can. Predicates may not influence non-related triples, however, other triples might be influenced through a cascade of relations. Why not? I can define predicates however I want, after all? Because, by definition of related, if your predicate is defined to influence a certain (kind of) triple, that triple is related to the usage of the predicate. Sure, but if I can add things to the RDF spec, then I could add something like: The triple a b c means that all subclassOf relationships are strict. What does using owl:differentFrom in RDF commit you to? It says that two things are different. Clients that can interpret this predicate can apply its meaning. This application does not change the model. What model? The RDF model. Do you mean that all you care about is the abstract syntax? No. But, but, but, isn't that what you said above? All that counts is triples, i.e., the abstract syntax. What about rdf:type? What about rdfs:domain? Do all consumers of RDF need to commit to the standard meaning of these predicates? Yes. But this goes beyond triples. RDF is just the model. Giving a predicate meaning is not extending the model. How so? What else is giving a predicate meaning besides extending the model? It defines something on top of the model. Building a home with bricks does not extend the bricks; it uses them. Yes, sure, which is why using rdfs:domain to infer rdf:type triples is not going beyond RDF(S). However, using owl:sameAs as equality and inferring other triples from this is going beyond RDF(S). It's just like turning a little brick into a long I-beam - you are no longer working with a little brick. I am really struggling to understand your view of RDF. Likewise. But maybe further discussing this doesn't really help the community. My view on RDF works for what I want to do and in my opinion, it's by no means an unreasonable view. But there might be other views… and that might just be fine. Well that's a bit debatable. Standards, even W3C standards, are there so that there is commonality of understanding. If different people take different views of RDF, then its utility is weakened, particularly if everyone still thinks that they are all using the same thing. My view is that getting these differences of opinion out in the open is very helpful. My view is that RDF is defined by the W3C RDF recommendation and that going beyond the inferences sanctioned here is no longer RDF. Best, Ruben Petter
Re: Inference for error checking [was Re: How to avoid that collections break relationships]
Well, certainly, one could do this if one wanted to. However, is this a useful thing to do, in general, particularly in the absence of constructs that actually sanction the inferenceand particularly if the checking is done in a context where there is no way of actually getting the author to fix whatever problems are encountered? My feelings are that if you really want to do this, then the place to do it isduring data entry or data importation. peter On 04/03/2014 03:12 PM, David Booth wrote: First of all, my sincere apologies to Pat, Peter and the rest of the readership for totally botching my last example, writing domain when I meant range *and* explaining it wrong. Sorry for all the confusion it caused! I was simply trying to demonstrate how a schema:domainIncludes assertion could be useful for error checking even if it had no formal entailments, by making selective use of the CWA. I'll try again. Suppose we are given these RDF statements, in which the author *may* have made a typo, writing ddd instead of ccc as the rdf:type of x: x ppp y . # Triple A x rdf:type ddd .# Triple B ppp schema:domainIncludes ccc. # Triple C As given, these statements are consistent, so a reasoner will not detect a problem. Indeed, they may or may not be what the author intended. If the author later added the statement: ccc owl:equivalentClass ddd . # Triple E then ddd probably was what the author intended in triple B. OTOH if the author later added: ccc owl:disjointWith ddd . # Triple F then ddd probably was not what the author intended in triple B. However, thus far we are only given triples {A,B,C} above, and an error checker wishes to check for *potential* typos by applying the rule: For all subgraphs of the form { x ppp y . ppp schema:domainIncludes ccc . } check whether { x rdf:type ccc . } is *provably* true. If not, then fail the error check. If all such subgraphs pass, then the error check as a whole passes. Under the OWA, the requirement: { x rdf:type ccc . } is neither provably true nor provably false given graph {A,B,C}. But under the CWA it is considered false, because it is not provably true. This is how the schema:domainIncludes can be useful for error checking even if it has no formal entailments: it tells the error checker which cases to check. I hope that now makes more sense. Again, sorry to have screwed up my example so badly last time, and I hope I've got it right this time. :) David On 04/02/2014 11:42 PM, Pat Hayes wrote: On Mar 31, 2014, at 10:31 AM, David Booth da...@dbooth.org wrote: On 03/30/2014 03:13 AM, Pat Hayes wrote: [ , . . ] What follows from knowing that ppp schema:domainIncludes ccc . ? Suppose you know this and you also know that x ppp y . Can you infer x rdf:type ccc? I presume not, since the domain might include other stuff outside ccc. So, what *can* be inferred about the relationship between x and ccc ? As far as I can see, nothing can be inferred. If I am wrong, please enlighten me. But if I am right, what possible utility is there in even making a schema:domainIncludes assertion? If inference is too strong, let me weaken my question: what possible utility **in any way whatsoever** is provided by knowing that schema:domainIncludes holds between ppp and ccc? What software can do what with this, that it could not do as well without this? I think I can answer this question quite easily, as I have seen it come up before in discussions of logic. ... Note that this categorization typically relies on making a closed world assumption (CWA), which is common for an application to make for a particular purpose -- especially error checking. Yes, of course. If you make the CWA with the information you have, then ppp schema:domainIncludes ccc . has exactly the same entailments as ppp rdfs:domain ccc . has in RDFS without the CWA. But that, of course, begs the question. If you are going to rely on the CWA, then (a) you are violating the basic assumptions of all Web notations and (b) you are using a fundamentally different semantics. And see below. None of this has anything to do with a distinction between entailment and error checking, by the way. Your hypothetical three-way classification task uses the same meanings of the RDF as any other entailment task would. In this example, let us suppose that to pass, the object of every predicate must be in the Known Domain of that predicate, where the Known Domain is the union of all declared schema:domainIncludes classes for that predicate. (Note the CWA here.) Given this error checking objective, if a system is given the facts: x ppp y . y a ccc . then without also knowing that ppp schema:domainIncludes ccc, the system may not be able to determine that these statements should be considered Passed or Failed: the result may be Indeterminate. But if the system is also told
Re: How to avoid that collections break relationships
On 03/31/2014 01:59 PM, Ruben Verborgh wrote: In actuality, defining things like owl:sameAs is indeed extending RDF. Defining things in terms of OWL connectives also goes beyond RDF. This is different from introducing domain predicates like foaf:friends. (Yes, it is sometimes a bit hard to figure out which side of the line one is on.) Thanks for clarifying, and this is indeed where we disagree. For me, such a line does not exist, nor was it ever defined. And even if there were, I don't see the need to draw it. RDF is the framework, the interpretation is semantics. All predicates have meaning associated with them, none has “more” meaning than the other; maybe some usually allow to infer more triples, but that doesn't change the framework at all. Cheers, Ruben What then is RDF for you? For example, do you consider N3 to be RDF? Can the predicates be modal operators? Can predicates have non-local effects? What does using owl:differentFrom in RDF commit you to? To me, what RDF does not do is just as important and what it does do. This means that RDF captures only the RDF bit of the meaning of predicates - the rest of their meaning remains inaccessible from RDF. Any attempt to go beyond this is ... going beyond RDF and it is very important do realize this. peter
Re: Inference for error checking [was Re: How to avoid that collections break relationships]
On 03/31/2014 01:39 PM, David Booth wrote: On 03/31/2014 11:59 AM, Peter F. Patel-Schneider wrote: [...] Given this error checking objective, if a system is given the facts: x ppp y . y a ccc . then without also knowing that ppp schema:domainIncludes ccc, the system may not be able to determine that these statements should be considered Passed or Failed: the result may be Indeterminate. But if the system is also told that ppp schema:domainIncludes ccc . then it can safely categorize these statements as Passed (within the limits of this error checking). Sure, but it can be very tricky to determine just what facts to consider when making this determination, particularly with the upside-down nature of schema:domainIncludes My assumption in this example is that the application already has a set of assertions that it intends to work with, and it wishes to error check them. It is quite tricky to figure out what this set of assertions should be? For example, are consequences of other facts allowed? All of them? Thus, although schema:domainIncludes does not enable any new entailments under the open world assumption (OWA), it *does* enable some useful error checking inference under the closed world assumption (CWA), by enabling a shift from Indeterminate to Passed or Failed. The CWA actually works against you here. Given the following triples, x ppp y . # Triple A y rdf:type ddd .# Triple B ppp schema:domainIncludes ccc. # Triple C you are determining whether y rdf:type ccc. # Triple E is entailed, whether its negation is entailed, or neither. The relevant CWA would push these last two together, making it impossible to have a three-way determination, which you want. I don't think that's quite it. The error check that I described is not the same as checking whether NOT(y rdf:type ccc) is entailed. (Such a conclusion could be entailed if there were an owl:disjointWith assertion, for example.) It is checking whether (y rdf:type KnownDomain(ppp)). In other words, the CWA is not being made in testing whether (y rdf:type ccc); rather it is being made in computing KnownDomain(ppp). Huh? What is this KnownDomain construct? Where does it come from? How is it computed? The net effect of this is that the CWA is being used to distinguish between cases that would all be considered unknown under the OWA. I still don't see a play for the CWA here. David peter
Re: How to avoid that collections break relationships
I don't see how this solves the problem. Even if you augment RDF with this set construct (hydra:memberOf pointing to a separate document containing a collection of entities), what you are saying is thatMarkus knows some entity, and that that entity belongs to the set. It does not say which of the members of the set are friends of Markus, nor that any more than one of them has to be. peter On 03/31/2014 01:34 AM, Ruben Verborgh wrote: Dear all, Sorry for hacking the discussion, but I think we should keep the discussion goal-focused. So let's therefore see what we want to achieve: 1. Having a way for clients to find out the members of a specific collection 2. Not breaking the RDF model while doing so A solution that satisfies 1 and 2 with minimal effort is good enough for Hydra; the rest can be discussed more deeply in other places. The easiest solution I could come up with that satisfies the above criteria is the following. Suppose a client needs to find Markus' friends, and the server use foaf:knows for that (which has the restrictive range foaf:Person, disallowing a collection). If the representation contains all of Markus' friends, then it could look like: /people/markus foaf:knows /people/Anna. /people/markus foaf:knows /people/Bert. /people/markus foaf:knows /people/Carl. Now, more interestingly, if the list of Markus' friends is available as a separate resource /people/markus/friends, then it could look like: /people/markus foaf:knows [ hydra:memberOf /people/markus/friends ]. So we say that a blank node is one of Markus' friends, and where it can be found. This satisfies 1, because the client can follow the link and find all friends there. This satisfies 2, because the blank node is an actual person, not a collection. And that is all we need for hypermedia clients to work. Yes, I know this does not add a whole bunch of extensive semantics we might need for specific case X, but: a) that's not necessary in general; a Hydra client has all it needs; b) this solution is extensible to allow for that. If you like, you can add details about /people/markus/friends, say that they all have a foaf:knows relationship to /people/markus etc. Summarized: look at the minimum client needs, implement that; only thing we need is a blank node and a memberOf predicate. Hydra clients work; the model is happy too. Best, Ruben PS The only case this slightly breaks the model is if Markus has no friends yet; then you say Markus knows somebody while he actually doesn't. But saying something doesn't exists is a problem in RDF anyway. The easy way: just don't include any foaf:knows triple (or ignore slight breakage). If you insist to include _something_, we'd need to do have an explicit empty list: /people/markus foaf:knowsList (). foaf:knowsList hydra:listPropertyOf foaf:knows. But then we'd be stuck with twice as many properties, which is not ideal either.
Re: How to avoid that collections break relationships
On 03/31/2014 08:02 AM, Ruben Verborgh wrote: Hi Peter, I don't see how this solves the problem. Recall that “the problem” for Hydra clients is (in my opinion): 1. Having a way to find out the members of a specific collection 2. Not breaking the RDF model while doing so what you are saying is thatMarkus knows some entity, and that that entity belongs to the set. …and I give that set a URL, too. It does not say which of the members of the set are friends of Markus, nor that any more than one of them has to be …on purpose, since it is not needed at all to solve 1 and 2. /people/markus foaf:knows [ hydra:memberOf /people/markus/friends ]. Gives you the URL of the set, which allows you to retrieve it. Then if you retrieve it, it will list its members and say they are friends of Markus: /people/markus foaf:knows /people/Anna. /people/markus foaf:knows /people/Bert. /people/markus foaf:knows /people/Carl. But this is violating both the spirit and the letter of RDF. It would be better to introduce entirely new syntactic mechanisms, for example, something like /people/markus foaf:knows **http://.../people/markus/friends** which could be read as shorthand for replacing the **...** with an object list containing the objects in the document being pointed at. This is exactly the reason I propose this approach; I know many people want more semantics in there; and that's totally fine and even possible with this approach (like I said, you can further describe /people/markus/friends if you like.) But all we need for Hydra clients is a way to get Markus' friends. This solution offers it. No more semantics needed. Huh? If you want to be in the RDF camp, you have to play by RDF rules. However, maybe all you want is something that is syntactically RDF. In that case there are a multitude of solutions. Yours might be somewhat better than the other ones, but the differences look rather superficial when viewed through an RDF lens. Best, Ruben peter
Re: How to avoid that collections break relationships
On 03/31/2014 08:29 AM, Ruben Verborgh wrote: Hi Peter, This is why I started by saying the focus of the discussion should be on what we want to achieve. With my proposed solution, it is achieved. Furthermore, this solution allows you to add any metadata you might like; a Hydra client just wouldn't need it (even though others might). Right now, don't need anything else than just finding the members of a collection. But this is violating both the spirit and the letter of RDF. It would be better to introduce entirely new syntactic mechanisms A new syntax would break everything that exists. How is that better? The proposed approach doesn't break anything and achieves what we need, without violating the RDF model. Huh? If you want to be in the RDF camp, you have to play by RDF rules. And we do that. /people/markus foaf:knows [ hydra:memberOf /people/markus/friends ]. means “Markus knows somebody who is a member of collection X. But that's not what this says. It says that Markus knows some entity that is related by an unknown relationship to some unknown other entity. Check that collection X to find out if Markus knows more of them. I'm not saying there will be more in there… just saying that you could check it. Handy for a hypermedia client. Works in practice, doesn't break the model. If you want more semantics, just add them: /people/markus/friends :isACollectionOf [ :hasPredicate foaf:knows; :hasSubject /people/Markus ] But that is _not_ needed to achieve my 1 and 2. Well this certainly adds more triples. Whether it adds more meaning is a separate issue. Best, Ruben It appears that you feel that adding significant new expressive power is somehow less of a change than adding new syntax. I do not feel this way at all. peter
Re: Inference for error checking [was Re: How to avoid that collections break relationships]
On 03/31/2014 08:31 AM, David Booth wrote: On 03/30/2014 03:13 AM, Pat Hayes wrote: [ , . . ] What follows from knowing that ppp schema:domainIncludes ccc . ? Suppose you know this and you also know that x ppp y . Can you infer x rdf:type ccc? I presume not, since the domain might include other stuff outside ccc. So, what *can* be inferred about the relationship between x and ccc ? As far as I can see, nothing can be inferred. If I am wrong, please enlighten me. But if I am right, what possible utility is there in even making a schema:domainIncludes assertion? If inference is too strong, let me weaken my question: what possible utility **in any way whatsoever** is provided by knowing that schema:domainIncludes holds between ppp and ccc? What software can do what with this, that it could not do as well without this? I think I can answer this question quite easily, as I have seen it come up before in discussions of logic. Entailment produces statements that are known to be true, given a set of facts and entailment rules. And indeed, adding the fact that ppp schema:domainIncludes ccc . to a set of facts produces no new entailments in that sense. Is it then your contention that schema:domainIncludes does not add any new entailments under the schema.org semantics? But it *does* enable another kind of very useful machine-processable inference that is useful in error checking, which I'll describe. In error checking, it is sometimes useful to classify a set of statements into three categories: Passed, Failed or Indeterminate. Passed means that the statements are fine (within the checkable limits anyway): sufficient information has been provided, and it is internally consistent. Failed means that there is something malformed about them (according to the application's purpose). Indeterminate means that the system does not have enough information to know whether the statements are okay or not: further work might need to be performed, such as manual examination or adding more information (facts) to the system. Hence, it is *useful* to be able to quickly and automatically establish that the statements fall into the Passed or Failed category. Note that this categorization typically relies on making a closed world assumption (CWA), which is common for an application to make for a particular purpose -- especially error checking. I don't see that the CWA is particularly germane here, except that most formalisms that do this sort of checking also utilize some sort of CWA. There is notthing wrong with performing this sort of analysis in formalisms that do not have any form of CWA. What does cause problems with this sort of analysis is the presence of non-trivial inference. In this example, let us suppose that to pass, the object of every predicate must be in the Known Domain of that predicate, where the Known Domain is the union of all declared schema:domainIncludes classes for that predicate. (Note the CWA here.) Given this error checking objective, if a system is given the facts: x ppp y . y a ccc . then without also knowing that ppp schema:domainIncludes ccc, the system may not be able to determine that these statements should be considered Passed or Failed: the result may be Indeterminate. But if the system is also told that ppp schema:domainIncludes ccc . then it can safely categorize these statements as Passed (within the limits of this error checking). Sure, but it can be very tricky to determine just what facts to consider when making this determination, particularly with the upside-down nature of schema:domainIncludes Thus, although schema:domainIncludes does not enable any new entailments under the open world assumption (OWA), it *does* enable some useful error checking inference under the closed world assumption (CWA), by enabling a shift from Indeterminate to Passed or Failed. The CWA actually works against you here. Given the following triples, x ppp y . y rdf:type ddd . ppp schema:domainIncludes ccc. you are determining whether y rdf:type ccc. is entailed, whether its negation is entailed, or neither. The relevant CWA would push these last two together, making it impossible to have a three-way determination, which you want. If anyone is concerned that this use of the CWA violates the spirit of RDF, which indeed is based on the OWA (for *very* good reason), please bear in mind that almost every application makes the CWA at some point, to do its job. David peter
Re: How to avoid that collections break relationships
On 03/31/2014 08:48 AM, Ruben Verborgh wrote: /people/markus foaf:knows [ hydra:memberOf /people/markus/friends ]. means “Markus knows somebody who is a member of collection X. But that's not what this says. It says that Markus knows some entity that is related by an unknown relationship to some unknown other entity. Well, obviously we'd have to define the hydra:memberOf predicate… It's not helpful to interpret foaf:knows as knows but hydra:memberOf as unknown relationship”. In this case it certainly is. You want to depend on a particular reading of this non-RDF predicate, and have this reading trigger inferences. A system that only uses the RDF semantics will have no knowledge of this extra semantics and will thus not perform these essential inferences. Yes, when one is being totally formal one should not change from foaf:knows to knows, but there is no formal fallout from this shadiness. And “unknown entity” is intended; this is why you have to fetch it if you're curious. If you want more semantics, just add them: /people/markus/friends :isACollectionOf [ :hasPredicate foaf:knows; :hasSubject /people/Markus ] But that is _not_ needed to achieve my 1 and 2. Well this certainly adds more triples. Whether it adds more meaning is a separate issue. Obviously, we'd define isACollectionOf as well. Again making a significant addition to RDF. It appears that you feel that adding significant new expressive power is somehow less of a change than adding new syntax. I'm not adding any new expressive power. Can you point exactly to where you think I'm doing that? Yes, I define a memberOf predicate that clients have to understand. That's new expressive power. But that's a given if we just define it was owl:inverseProperty hydra:member. Which is precisely my point. You are using OWL, not just RDF. If you want to do this in a way that fits in better with RDF, it would be better to add to the syntax of RDF without adding to the semantics of RDF. Best, Ruben peter
Re: How to avoid that collections break relationships
If you want a hydra solution, then you should do whatever is needed to make it a hydra solution. In actuality, defining things like owl:sameAs is indeed extending RDF. Defining things in terms of OWL connectives also goes beyond RDF. This is different from introducing domain predicates like foaf:friends. (Yes, it is sometimes a bit hard to figure out which side of the line one is on.) peter On 03/31/2014 09:26 AM, Ruben Verborgh wrote: Peter, Please, let's get the discussion back to what we want to achieve in the first place. Right now, the solution is being evaluated on a dozen of other things that are not relevant. Proposal: let's discuss the whole abstract RDF container thing on public-lod@w3.org, and solutions to make clients work at public-hy...@w3.org. We're talking here about making clients able to get the members of something. Yes, they will need to interpret some properties. Just like an OWL reasoner needs to interpret owl:sameAs, a Hydra client needs to interpret hydra:member. That is how applications work. In no way, defining a vocabulary is extending RDF. RDF is a framework. I'm not adding to the framework. I'm proposing a simple property hydra:memberOf owl:inverseProperty hydra:member. If you really don't like me introducing a property, here's an alternative way of saying the same thing: /people/markus foaf:knows _:x. /people/markus/friends hydra:member _:x. There you go. hydra:member was already defined, I'm not inventing or adding anything. You want to depend on a particular reading of this non-RDF predicate, and have this reading trigger inferences. No I don't want any of that. Why do think I'd want that? Where did I say I want inferences? Where do I need them? Also, how could it possibly be a non-RDF predicate? RDF simply defines a predicate as an IRI [1]. Again making a significant addition to RDF. When did defining a vocabulary become adding to RDF? Which is precisely my point. You are using OWL, not just RDF. If you want to do this in a way that fits in better with RDF, it would be better to add to the syntax of RDF without adding to the semantics of RDF. …but this has _never_ been about extending RDF in any way, nor has it been about only using RDF or only using OWL. We don't want any of that. We want: 1. Having a way for clients to find out the members of a specific collection 2. Not breaking the RDF model while doing so The proposed solution achieves both objectives. Best, Ruben [1] http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#dfn-predicate
Re: How to avoid that collections break relationships
On 03/30/2014 12:13 AM, Pat Hayes wrote: On Mar 29, 2014, at 8:10 PM, Peter F. Patel-Schneider pfpschnei...@gmail.com wrote: On 03/29/2014 03:30 PM, Markus Lanthaler wrote: On Wednesday, March 26, 2014 5:26 AM, Pat Hayes wrote: Hmm. I would be inclined to violate IRI opacity at this point and have a convention that says that any schema.org property schema:ppp can have a sister property called schema:pppList, for any character string ppp. So you ought to check schema:knowsList when you are asked to look for schema:knows. Then although there isn't a link in the conventional sense, there is a computable route from schema:knows to schema:knowsList, which as far as I am concerned amounts to a link. Schema.org doesn't suffer from this issue as much as other vocabularies do as it isn't defined with RDFS but uses its own, looser description mechanisms such as schema:domainIncludes and schema:rangeIncludes. So what I'm really looking for is a solution that would work in general, not just for some vocabularies. [...] -- Markus Lanthaler @markuslanthaler I would like to see some firm definition of just how these looser description mechanisms actually work. Yes, I agree. Let me put the question rather more sharply. What follows from knowing that ppp schema:domainIncludes ccc . ? Suppose you know this and you also know that x ppp y . Can you infer x rdf:type ccc? I presume not, since the domain might include other stuff outside ccc. So, what *can* be inferred about the relationship between x and ccc ? As far as I can see, nothing can be inferred. If I am wrong, please enlighten me. But if I am right, what possible utility is there in even making a schema:domainIncludes assertion? If inference is too strong, let me weaken my question: what possible utility **in any way whatsoever** is provided by knowing that schema:domainIncludes holds between ppp and ccc? What software can do what with this, that it could not do as well without this? Having a piece of formalism which claims to be a 'weak' assertion becomes simply ludicrous when it is so weak that it carries no content at all. This bears the same relation to axiom writing that miming does to wrestling. Pat Perhaps this could be somewhat sharpened to that professional wrestling does to wrestling. peter
Re: How to avoid that collections break relationships
On 03/29/2014 03:30 PM, Markus Lanthaler wrote: On Wednesday, March 26, 2014 5:26 AM, Pat Hayes wrote: Hmm. I would be inclined to violate IRI opacity at this point and have a convention that says that any schema.org property schema:ppp can have a sister property called schema:pppList, for any character string ppp. So you ought to check schema:knowsList when you are asked to look for schema:knows. Then although there isn't a link in the conventional sense, there is a computable route from schema:knows to schema:knowsList, which as far as I am concerned amounts to a link. Schema.org doesn't suffer from this issue as much as other vocabularies do as it isn't defined with RDFS but uses its own, looser description mechanisms such as schema:domainIncludes and schema:rangeIncludes. So what I'm really looking for is a solution that would work in general, not just for some vocabularies. [...] -- Markus Lanthaler @markuslanthaler I would like to see some firm definition of just how these looser description mechanisms actually work. peter
Re: How to avoid that collections break relationships
Let's see if I have this right. You are encountering a situation where thenumber of people Markus knows is too big (somehow). The proposed solution is to move this information to a separate location. I don't see how this helps in reducing the size of the information, which was the initial problem. Splitting this information into pieces might help. schema.org, along with just about every other RDF syntax, doesnot require that all the information about aparticularentity is in the same spot. The problem then is to ensure that all the information is accessed together. schema.org, somewhat separate from other RDF syntaxes, does have facilities for this. All you needto do is to set up multiple pages, for example .../markus1 through.../markusn and on each of these pages include schema.org markup withcontent like .../markusi schema:url .../markus .../markus schema:knows .../friendi1 ... .../markus schema:knows .../friendimi Then on .../markus you have .../markus schema:url .../markus1 ... .../markus schema:url .../markusn (Maybe schema:sameAs is a better relationshipto use here, but they both should work.) Voila! (With the big provisio that I have no idea whether the schema.org processors actually dothe right thing here, asthere is no indication of what they do do.) peter PS: LDP?? On 03/24/2014 08:24 AM, Markus Lanthaler wrote: Hi all, We have an interesting discussion in the Hydra W3C Community Group [1] regarding collections and would like to hear more opinions and ideas. I'm sure this is an issue a lot of Linked Data applications face in practice. Let's assume we want to build a Web API that exposes information about persons and their friends. Using schema.org, your data would look somewhat like this: /markus a schema:Person ; schema:knows /alice ; ... schema:knows /zorro . All this information would be available in the document at /markus (please let's not talk about hash URLs etc. here, ok?). Depending on the number of friends, the document however may grow too large. Web APIs typically solve that by introducing an intermediary (paged) resource such as /markus/friends/. In Schema.org we have ItemList to do so: /markus a schema:Person ; schema:knows /markus/friends/ . /markus/friends/ a schema:ItemList ; schema:itemListElement /alice ; ... schema: itemListElement /zorro . This works, but has two problems: 1) it breaks the /markus --[knows]-- /alice relationship 2) it says that /markus --[knows]-- /markus/friends While 1) can easily be fixed, 2) is much trickier--especially if we consider cases that don't use schema.org with its weak semantics but a vocabulary that uses rdfs:range, such as FOAF. In that case, the statement /markus foaf:knows /markus/friends/ . and the fact that foaf:knows rdfs:range foaf:Person . would yield to the wrong inference that /markus/friends is a foaf:Person. How do you deal with such cases? How is schema.org intended to be used in cases like these? Is the above use of ItemList sensible or is this something that should better be avoided? Thanks, Markus P.S.: I'm aware of how LDP handles this issue, but, while I generally like the approach it takes, I don't like that fact that it imposes a specific interaction model. [1] http://bit.ly/HydraCG -- Markus Lanthaler @markuslanthaler
Re: How to avoid that collections break relationships
On 03/25/2014 10:40 AM, Markus Lanthaler wrote: On Tuesday, March 25, 2014 5:49 PM, Peter F. Patel-Schneider wrote: Let's see if I have this right. You are encountering a situation where thenumber of people Markus knows is too big (somehow). The proposed solution is to move this information to a separate location. I don't see how this helps in reducing the size of the information, which was the initial problem. Cynical as usual :-) Let's just assume that the vast majority of the clients aren't interested in Markus' friends but just in information about him. Thus, they shouldn't have to process megabytes of friend relationships that they are to be ignoring anyway. Those few clients that are interested in those relationships, however, need a mechanism to find them. Aah. However, this is a new requirement. So what you want is to be able to cherry-pick the data associated with Markus, and not even have to pay for transmitting the unwanted bits. This is definitely not supported by schema.org. To do this in general would require specifying the data that you want in the request. Splitting this information into pieces might help. schema.org, along with just about every other RDF syntax, doesnot require that all the information about aparticularentity is in the same spot. The problem then is to ensure that all the information is accessed together. schema.org, somewhat separate from other RDF syntaxes, does have facilities for this. All you needto do is to set up multiple pages, for example .../markus1 through.../markusn and on each of these pages include schema.org markup withcontent like .../markusi schema:url .../markus I'm still wondering what schema:url is actually for and how it relates to Microdata's itemid, RDFa's resource and JSON-LD's @id... but that's a separate discussion. Yeah. I'm still waiting for the better documentation that was supposed to be coming shortly after ISWC 2013. .../markus schema:knows .../friendi1 ... .../markus schema:knows .../friendimi Then on .../markus you have .../markus schema:url .../markus1 ... .../markus schema:url .../markusn (Maybe schema:sameAs is a better relationshipto use here, but they both should work.) Yeah, this would of course work, but it doesn't tell the client at all why it should follow schema:url links to /markus{n}. The same is more or less true abut schema:sameAs. -- Markus Lanthaler @markuslanthaler Voila! (With the big provisio that I have no idea whether the schema.org processors actually dothe right thing here, asthere is no indication of what they do do.) peter PS: LDP?? Linked Data Platform: http://www.w3.org/TR/ldp/
Re: How to avoid that collections break relationships
On 03/24/2014 08:24 AM, Markus Lanthaler wrote: Hi all, [snip] Thanks, Markus P.S.: I'm aware of how LDP handles this issue, but, while I generally like the approach it takes, I don't like that fact that it imposes a specific interaction model. So LDP handles this issue, and is going through the W3C process. Why not hold your nose and use it? Or even better, participate and fix it? peter PS: It took me quite a while to figure out just that LDP was trying to do, and how it was proposing to do it. That's probably a sign that more user-facing documentation is needed.
Re: How to avoid that collections break relationships
On 03/25/2014 11:21 AM, Markus Lanthaler wrote: On Tuesday, March 25, 2014 7:04 PM, Peter F. Patel-Schneider wrote: On 03/24/2014 08:24 AM, Markus Lanthaler wrote: Hi all, [snip] Thanks, Markus P.S.: I'm aware of how LDP handles this issue, but, while I generally like the approach it takes, I don't like that fact that it imposes a specific interaction model. So LDP handles this issue, and is going through the W3C process. Why not hold your nose and use it? Or even better, participate and fix it? Because in my opinion the model LDP is based on is doomed to fail. I expressed that concern a couple of times (at conferences where LDP was presented).. if I find the time (which will be tricky as I'm traveling from tomorrow onwards), I may even do a full review of the spec. Probably the best thing to do next is to elucidate all of what you do want, as it appears to be quite complex, involving not just the amount of data being served, but also how to access specific parts of the data under the control of the requester. Otherwise, you are going to continue to get solutions that address only what you are asking about, which do not appear to be meeting your needs. peter