Re: [Wiki-research-l] [Foundation-l] WikiCite - new WMF project? Was: UPEI's proposal for a universal citation index
On Tue, Jul 20, 2010 at 9:26 PM, Brian J Mingus brian.min...@colorado.edu wrote: I like your suggestion that the abc disambiguator be chosen based on the first date of publication, and I also like the prospect of using slashes since they can't be contained in names. Using the full year is a good idea too. We can combine these to come up with a key that, in principle, is guaranteed to be unique. This key would contain: 1) The first three author names separated by slashes why not separate by pluses? they don't form part of names either, and don't cause problems with wiki page titles. 2) If there are more than three authors, an EtAl don't think that's necessary if we get the abc part right. 3) Some or all of the date. For instance, if there is only one source by this set of authors that year, we can just use . However, once another source by those set of authors is added, the key should change to MMDD or similar. I don't think it is a good idea to change one key as a function of updates on another, except for a generic disambiguation tag. If there are multiple publications on the same day, we can resort to abc. Redirects and disambiguation pages can be set up when a key changes. As Jodi pointed out already, the exact date is often not clearly identifiable, so I would go simply for the year. Instead of an alphabetic abc, one could use some function of the article title (e.g. the first three words thereof, or the initials of the first three words), always in lower case. An even less ambiguous abc would be starting page (for printed stuff) or article number (for online only) but this brings us back to the 7523225 problem you mentioned above. Since the slashes are somewhat cumbersome, perhaps we can not make them mandatory, but similarly use them only when they are necessary in order to escape a name. In the case that one of the authors does not have a slash in their name - the dominant case - we can stick to the easily legible and niecly compact CamelCase format. Example keys generated by this algorithm: KangHsuKrajbichEtAl2009 Kang+Hsu+Krajbich+2009+the+wick+in or Kang+Hsu+Krajbich+2009+twi also note that the CamelCase key does not yield results in a google search, whereas the first plused variant brings up the right work correctly, while the plused one with initialed title tends to bring at least something written by or cited from these authors. Author1Author2/Author-Three/2009 Author1+Author2+Author-Three+2009+just+another+article or Author1+Author2+Author-Three+2009+jat Of course, it does not have to be _exactly_ three authors, nor three words from the title, and it does not solve the John Smith (or Zheng Wang) problem. Daniel -- http://www.google.com/profiles/daniel.mietchen ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Foundation-l] WikiCite - new WMF project? Was: UPEI's proposal for a universal citation index
1) The first three author names separated by slashes why not separate by pluses? they don't form part of names either, and don't cause problems with wiki page titles. I like this... however, how would you represent this in a URL? Also note that using plusses in page names don't work with all server configurations, since plus has a special meaning in URLs. 3) Some or all of the date. For instance, if there is only one source by this set of authors that year, we can just use . However, once another source by those set of authors is added, the key should change to MMDD or similar. I don't think it is a good idea to change one key as a function of updates on another, except for a generic disambiguation tag. I agree. And if you *have* to use the full date, use MMDD, not the other way around, please. Since the slashes are somewhat cumbersome, perhaps we can not make them mandatory, but similarly use them only when they are necessary in order to escape a name. In the case that one of the authors does not have a slash in their name - the dominant case - we can stick to the easily legible and niecly compact CamelCase format. Example keys generated by this algorithm: KangHsuKrajbichEtAl2009 Kang+Hsu+Krajbich+2009+the+wick+in or Kang+Hsu+Krajbich+2009+twi Both seem good, though i would suggest to form a convention to ignore any leading the and a, to a more distinctive 3 word suffix. Of course, it does not have to be _exactly_ three authors, nor three words from the title, and it does not solve the John Smith (or Zheng Wang) problem. It also doesn't solve issues with transliteration: Merik Möller may become Moeller or Moller, Jakob Voß may become Voss or Vosz or even VoB, etc. In case of chinese names, it's often not easy to decide which part is the last name. To avoid this kind of ambiguity, i suggest to automatically apply some type of normalization and/or hashing. There is quite a bit of research about this kind of normalisation out there, generally with the aim of detecting duplicates. Perhaps we can learn from bibsonomy.org, have a look how they do it: http://www.bibsonomy.org/help/doc/inside.html. Gotta love open source university research projects :) -- daniel ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[Wiki-research-l] page numbers
Jeff makes some good points about page numbers on public-lld (where I had forwarded part of this conversation). -Jodi Begin forwarded message: Resent-From: public-...@w3.org From: Young,Jeff (OR) jyo...@oclc.org Date: 20 July 2010 22:53:40 GMT+01:00 To: Tom Morris tfmor...@gmail.com Cc: Karen Coyle kco...@kcoyle.net, Jodi Schneider jschnei...@pobox.com, public-lld public-...@w3.org, Code for Libraries code4...@listserv.nd.edu, Brian Mingus brian.min...@colorado.edu Subject: RE: universal citation index I suspect this discussion happened on code4lib before the thread got cross-posting to LLD XG where I first saw it. There are undoubtedly a ton of diverse use cases, but that doesn't mean APIs are the best solution. Here are some spitball possibilities for not just manifestations and we need page numbers. http://example.org/frbr:serial/2/citation-apa.{bcp-47}.txt http://example.org/frbr:manifestation/1/citation-apa.{bcp-47}.txt?xyz:st artPage=5xyz:endPage=6 I'm imagining an xyz ontology with startPage and endPage, but we can surely create it if something doesn't already exist. Jeff -Original Message- From: Tom Morris [mailto:tfmor...@gmail.com] Sent: Tuesday, July 20, 2010 5:37 PM To: Young,Jeff (OR) Cc: Karen Coyle; Jodi Schneider; public-lld; Code for Libraries; Brian Mingus Subject: Re: universal citation index On Tue, Jul 20, 2010 at 1:40 PM, Young,Jeff (OR) jyo...@oclc.org wrote: In terms of Linked Data, it should make sense to treat citations as text/plain variant representations of a FRBR Manifestation. As Karen mentioned, many types of citation need more information than just the manifestation. You also need pages numbers, etc. Tom ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Foundation-l] WikiCite - new WMF project? Was: UPEI's proposal for a universal citation index
On Wed, Jul 21, 2010 at 10:42 AM, Daniel Kinzler dan...@brightbyte.de wrote: 1) The first three author names separated by slashes why not separate by pluses? they don't form part of names either, and don't cause problems with wiki page titles. I like this... however, how would you represent this in a URL? %2B would seem to be the obvious choice to me. Also note that using plusses in page names don't work with all server configurations, since plus has a special meaning in URLs. Don't know too much about the double escaping business to comment on that, but if pluses are not acceptable, we still have equal signs (possibly with similar problems, but still useful for direct web search) and underscores (which would turn the whole key into one string for search engines). Daniel ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Foundation-l] WikiCite - new WMF project? Was: UPEI's proposal for a universal citation index
On Wed, Jul 21, 2010 at 2:42 AM, Daniel Kinzler dan...@brightbyte.dewrote: 1) The first three author names separated by slashes why not separate by pluses? they don't form part of names either, and don't cause problems with wiki page titles. I like this... however, how would you represent this in a URL? Also note that using plusses in page names don't work with all server configurations, since plus has a special meaning in URLs. 3) Some or all of the date. For instance, if there is only one source by this set of authors that year, we can just use . However, once another source by those set of authors is added, the key should change to MMDD or similar. I don't think it is a good idea to change one key as a function of updates on another, except for a generic disambiguation tag. I agree. And if you *have* to use the full date, use MMDD, not the other way around, please. Since the slashes are somewhat cumbersome, perhaps we can not make them mandatory, but similarly use them only when they are necessary in order to escape a name. In the case that one of the authors does not have a slash in their name - the dominant case - we can stick to the easily legible and niecly compact CamelCase format. Example keys generated by this algorithm: KangHsuKrajbichEtAl2009 Kang+Hsu+Krajbich+2009+the+wick+in or Kang+Hsu+Krajbich+2009+twi Both seem good, though i would suggest to form a convention to ignore any leading the and a, to a more distinctive 3 word suffix. Of course, it does not have to be _exactly_ three authors, nor three words from the title, and it does not solve the John Smith (or Zheng Wang) problem. It also doesn't solve issues with transliteration: Merik Möller may become Moeller or Moller, Jakob Voß may become Voss or Vosz or even VoB, etc. In case of chinese names, it's often not easy to decide which part is the last name. To avoid this kind of ambiguity, i suggest to automatically apply some type of normalization and/or hashing. There is quite a bit of research about this kind of normalisation out there, generally with the aim of detecting duplicates. Perhaps we can learn from bibsonomy.org, have a look how they do it: http://www.bibsonomy.org/help/doc/inside.html. Gotta love open source university research projects :) -- daniel Hey Daniel, Bibsonomy seems to suffer from the same problem as CiteULike - urls which convey no meaning. An example url id from CiteULike is 2434335, and one from Bibsonomy is 29be860f0bdea4a29fba38ef9e6dd6a09. I hope to continue to steer the conversation away from that direction. These IDs guarantee uniqueness, but I believe that we can create keys that both guarantee uniqueness and convey some meaning to humans. Consider that this key will be embedded in wiki articles any time a source is cited. It's important that it make some sense. Plus signs and slashes in the key appear to be cumbersome. Perhaps we can avoid this by truncating last names that involve a slash to either the portion before or after the slash. Changing the key seems to be a bad idea, so we want a key system that is unique from the start. That means we should use the full date, MMDD as suggested by Daniel. In the event that multiple sources are published by the same set of authors on the same day, we can use a, b, c disambiguation. This gives us the following key, guaranteed to be unique: KangHsuKrajbich20091011b Brian ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] There is no silver identifier
For items that have been assigned a doi, isn't the doi unique (in the absence of errors--which i cannot recall having ever encountered)? Of course the same item in its various manifestations may have multiple dois, or may have versions that do not have dois as well as versions that do have them, and the versions may or may not be identical. We also need to account for the presence of illegitimate as well as legitimate copies--a person entering a WP reference may have gotten it from a site that has an unauthorized copy--quite a few scientific papers are present on the web in such versions. There are really two problems: one is a pointer to the voucher authorized version of a document, which may well be the printed version, and the other problem is pointers to accessible legitimate versions. Crossref does a fairly nice job of this for online articles, but it organized to provide access to paid publishers versions preferentially, rather than to possible legitimate free versions. On Wed, Jul 21, 2010 at 6:20 PM, Jodi Schneider jodi.schnei...@deri.org wrote: On 21 Jul 2010, at 21:43, Reid Priedhorsky wrote: A compromise could be that the ID is the first author's name plus an auto-incrememented ID per author. So for example, the first paper of mine the system learns is priedhorsky1, the second priedhorsky2, etc. So you get a system-generated ID for uniqueness but also something comprehensible for people. Interesting. I'd really like ID's to be not only comprehensible but also to have a fair chance of being directly inputtable by humans. For instance, on Wikipedia, if I know that I am looking for the article on citation signals I can type the URL directly, without searching. In my ideal citation-wiki-in-the-sky, you could get to the citation directly in this way -- and sensible disambiguation pages would be automatically generated. -Jodi ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- David Goodman, Ph.D, M.L.S. http://en.wikipedia.org/wiki/User_talk:DGG ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Foundation-l] WikiCite - new WMF project? Was: UPEI's proposal for a universal citation index
The model for this is WP:Book sources, though this relies upon the user selecting the appropriate places to look, rather than guiding him. On Wed, Jul 21, 2010 at 6:33 PM, Jodi Schneider jodi.schnei...@deri.org wrote: On 21 Jul 2010, at 19:47, Brian J Mingus wrote: Finn, I'm not a fan of including a portion of the the title for a couple of reasons. First, it's not required to make the key unique. Second, it makes the key longer than necessary. Third, the first word or words from a title are not guaranteed to convey any meaning. Regarding a Reference: namespace, I can see how this has some utility and why projects have moved to it. However, I consider it a stopgap solution that projects have implemented when what they really want is a proper wiki for citations. Here are a few quick things that you can't do (or would have to go out of your way to do) with just a Reference namespace that you can do with a wiki dedicated to all the world's citations: - Custom reports that are boolean combinations of citation fields, ala SMW. This requires substantive new technology as SMW doesn't scale. - User bibliographies which are a logical subset of all literature ever published. Not sure why a Reference namespace couldn't do this. - Conduct a search of the literature. Or this (you can search just one namespace) - A new set of policies that are not necessarily NPOV, regarding the creation of articles that discuss collections of literature (lit review-like concept). The content of these policies will emerge over years with the help of a community. These articles could, for instance, help people who are navigating a new area of a literature avoid getting stuck in local minima. It could point out the true global context to them. It could point out experimenter biases in the literature; for example, a recent article was published where it was found that citation networks in academic literature can have a tendency to form based on the assumption of authority, when in fact that authority is false, bringing a whole thread of publications into doubt. I'm not sure that literature reviews belong in the same wiki as citations. That's definitely a different namespace. :) - Create wiki articles about individual sources. This might or might not be the same wiki -- but that could be interesting. I could imagine a page for a journal being pulled in from several sources: the collection of citations in the wiki for that journal, RSS from the current contents (license permitting), a Wikipedia page about the journal (if it exists), a link to author guidelines/submission info, open access info from SHERPA/ROMEO, In this vision, very little of the content lives in this wiki itself. Rather, it's templated from numerous other places Perhaps in the way buy this book links are handled in librarything -- there are numerous external links which can be activated with a checkbox, and some external content that is pulled in based on copyright review. While I am not dedicated to any of these things happening, I also do not wish to rule them out. The hope is that a new community will emerge around the project and guide it in the direction that is most useful. My hope in this thread is that we can identify some of the most likely cases and imagine what it will be like, so that we can convey this vision to the Foundation and they can get a sense of the potential importance of the project. Scoping is a big problem, I think -- because it would help to have a vision of which of several related tasks/endpoints is primary. I think an investigation of what fr.wikipedia is doing would be really useful -- does anybody edit there, or have an interest in digging into that? Questions might include: What is the reference namespace doing? What isn't it doing, that they wish it would? Did they consider alternatives to a namespace? How is maintenance going? Do they see the reference namespace as longstanding into the future, or as a stopgap? -Jodi ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- David Goodman, Ph.D, M.L.S. http://en.wikipedia.org/wiki/User_talk:DGG ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] Generate Wikipedia citations on Open Library
Why would anyone cite this particular edition? It's not the first ed., which is, I think, http://openlibrary.org/books/OL23411638M/inland_voyage. it's not even the first american edition. It's not a standard scholarly edition. It's not an earlier collected edition. It's not an edition which is currently in print. What's more, it's a defective record, because the date on the displayed cover does not match the date of the edition on the catalog record--which is the date on the title page of the actual copy scanned, which does not have the original cover. The cover was selected by an automatic algorithm, which got it wrong. If we're going to standardize citations, we should standardize a correct record to an appropriate version, not any version that happens along. Of course, that's considerably harder. But I dod not see the point of setting up an elaborate system based on bad data. . On Wed, Jul 21, 2010 at 3:44 PM, Edward Betts edw...@archive.org wrote: http://openlibrary.org/books/OL17963918M/An_inland_voyage -- David Goodman, Ph.D, M.L.S. http://en.wikipedia.org/wiki/User_talk:DGG ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] [Foundation-l] WikiCite - new WMF project? Was: UPEI's proposal for a universal citation index
Sure, but first, is this capable of being done at all? I have never seen a method of bibliographic control that can cope with the complete range of publications, even just print publications. Perhaps we need to proceed within narrow domains. Second, is this capable of being done by crowd-sourcing, or does it require enforceable standards? The work of Open Library is not a promising model, being a uncontrolled mix, done to many different standards. Actually, within the domain of scientific journal articles from the last 10 years in Western languages, the best current method seems to be a mechanical algorithm, the one used by Google Scholar. True, it does not aggregate perfectly--but it does aggregate better than any other existing database. And it does not get them all--nor could it no matter how much improved, for many of the versions that are actually available are off limits to its crawlers. On Wed, Jul 21, 2010 at 7:02 PM, Brian J Mingus brian.min...@colorado.edu wrote: On Wed, Jul 21, 2010 at 4:33 PM, Jodi Schneider jodi.schnei...@deri.org wrote: On 21 Jul 2010, at 19:47, Brian J Mingus wrote: Finn, I'm not a fan of including a portion of the the title for a couple of reasons. First, it's not required to make the key unique. Second, it makes the key longer than necessary. Third, the first word or words from a title are not guaranteed to convey any meaning. Regarding a Reference: namespace, I can see how this has some utility and why projects have moved to it. However, I consider it a stopgap solution that projects have implemented when what they really want is a proper wiki for citations. Here are a few quick things that you can't do (or would have to go out of your way to do) with just a Reference namespace that you can do with a wiki dedicated to all the world's citations: - Custom reports that are boolean combinations of citation fields, ala SMW. This requires substantive new technology as SMW doesn't scale. - User bibliographies which are a logical subset of all literature ever published. Not sure why a Reference namespace couldn't do this. - Conduct a search of the literature. Or this (you can search just one namespace) - A new set of policies that are not necessarily NPOV, regarding the creation of articles that discuss collections of literature (lit review-like concept). The content of these policies will emerge over years with the help of a community. These articles could, for instance, help people who are navigating a new area of a literature avoid getting stuck in local minima. It could point out the true global context to them. It could point out experimenter biases in the literature; for example, a recent article was published where it was found that citation networks in academic literature can have a tendency to form based on the assumption of authority, when in fact that authority is false, bringing a whole thread of publications into doubt. I'm not sure that literature reviews belong in the same wiki as citations. That's definitely a different namespace. :) - Create wiki articles about individual sources. This might or might not be the same wiki -- but that could be interesting. I could imagine a page for a journal being pulled in from several sources: the collection of citations in the wiki for that journal, RSS from the current contents (license permitting), a Wikipedia page about the journal (if it exists), a link to author guidelines/submission info, open access info from SHERPA/ROMEO, In this vision, very little of the content lives in this wiki itself. Rather, it's templated from numerous other places Perhaps in the way buy this book links are handled in librarything -- there are numerous external links which can be activated with a checkbox, and some external content that is pulled in based on copyright review. While I am not dedicated to any of these things happening, I also do not wish to rule them out. The hope is that a new community will emerge around the project and guide it in the direction that is most useful. My hope in this thread is that we can identify some of the most likely cases and imagine what it will be like, so that we can convey this vision to the Foundation and they can get a sense of the potential importance of the project. Scoping is a big problem, I think -- because it would help to have a vision of which of several related tasks/endpoints is primary. I think an investigation of what fr.wikipedia is doing would be really useful -- does anybody edit there, or have an interest in digging into that? Questions might include: What is the reference namespace doing? What isn't it doing, that they wish it would? Did they consider alternatives to a namespace? How is maintenance going? Do they see the reference namespace as longstanding into the future, or as a stopgap? -Jodi More broadly speaking, a reference namespace does not accomplish the goal of having a free
Re: [Wiki-research-l] [Foundation-l] WikiCite - new WMF project? Was: UPEI's proposal for a universal citation index
On Wed, Jul 21, 2010 at 5:47 PM, David Goodman dgoodma...@gmail.com wrote: Sure, but first, is this capable of being done at all? I have never seen a method of bibliographic control that can cope with the complete range of publications, even just print publications. Perhaps we need to proceed within narrow domains. I assume that by range you mean the number of publications in a domain, and that by domain you mean the type of publication, be it a book, webpage or map. The generic nature of a markup such as wiki template syntax allows us to easily adapt the same application to new domains. The challenge of the range within a domain is largely one of resolving ambiguities, which can be settled with policies that carefully adjudicate troublesome cases. Second, is this capable of being done by crowd-sourcing, or does it require enforceable standards? The work of Open Library is not a promising model, being a uncontrolled mix, done to many different standards. Actually, within the domain of scientific journal articles from the last 10 years in Western languages, the best current method seems to be a mechanical algorithm, the one used by Google Scholar. True, it does not aggregate perfectly--but it does aggregate better than any other existing database. And it does not get them all--nor could it no matter how much improved, for many of the versions that are actually available are off limits to its crawlers. In my conception the enforceable standards are to emerge in the meta pages of this project based on the actual issues that the community encounters. Googlebot has many deep web accounts to journals online. When you search Google Scholar the relevance algorithm is actually comparing your query to the content of pdf pages which you do not have permission to access. Of course, Google can't access them all, but many publishers have found it in their interest to give them a complimentary account since it drives subscription rates. We can rely on individuals, particularly academics, who have access to the deep web to help us curate the bibliography. And we can rely on the massive number of personal bibliographies already out there to help us get good coverage. Cleaning up the mass of bibliographic content that I anticipate would be uploaded by users would require the writing of bots in coordination with the creation of policy pages. Getting rid of copyright material would be handled in the same manner, I presume. After major content publishers see what we are doing, I am sure they will let us know their opinion about what we can and cannot do. It seems likely that they will overreach their bounds, and as I have seen on Wikipedia, the community members will happily ignore them. Or, if they think the requests are actually in compliance with the law, they will comply. Brian ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l