Re: Ranking Web of Repositories: July 2010 Edition

2010-07-19 Thread Rob Ingram
 -Original Message-
 From: American Scientist Open Access Forum [mailto:AMERICAN-SCIENTIST-
 open-access-fo...@listserver.sigmaxi.org] On Behalf Of Leslie Carr
 Sent: 12 July 2010 13:17
 To: american-scientist-open-access-fo...@listserver.sigmaxi.org
 Subject: Re: Ranking Web of Repositories: July 2010 Edition
 
 On 12 Jul 2010, at 06:25, Leslie Chan wrote:
 
  Why wait for Microsoft? What has the the open source community be
 doing on
  this front? What about OpenOffice? Any good open source NLM DTD
 conversion
  tools out there? Why has it taken so long?
 
 If there was something for open office then it would be trivial for
 repositories to apply it to Microsoft Word documents.
 --
 Les Carr

Coincidentally, a colleague has just alerted me to the PKP Lemon8
project that does just this conversion, though I'm not sure it has an
API.

http://pkp.sfu.ca/lemon8

Rob.


-- 
Rob Ingram

Technical Developer (RSP)

Centre for Research Communications
University of Nottingham
Greenfield Medical Library
A.31 Queen's Medical Centre
Nottingham NG7 2UH

T: +44 (0) 115 84 68602
F: +44 (0) 115 82 30549
rob.ing...@nottingham.ac.uk

http://rsp.ac.uk
http://crc.nottingham.ac.uk

This message has been checked for viruses but the contents of an attachment
may still contain software viruses which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.


Re: Ranking Web of Repositories: July 2010 Edition

2010-07-12 Thread Leslie Chan
On 7/11/10 6:49 AM, Leslie Carr l...@ecs.soton.ac.uk wrote:

 On 10 Jul 2010, at 15:37, Peter Suber wrote:
 For more detail on rich media or rich files, see the Webometrics
page on
 methodology:  Only the number of text files in Acrobat format (.pdf)
... are
 consideredThis is a bug, not a feature.  A more useful ranking
would try
 to count full-text scholarly or peer-reviewed articles regardless of
format.
 I know that's hard to do.  But it's a mistake to use any format as a
 surrogate for that status, and especially a format as flawed as PDF.
Even if
 Webometrics wanted to reward some formats more than others, it should not
 reward PDF.
 I think it should. The overwhelming majority of academic papers are
 distributed online as PDF; the overwhelming majority of things in
repositories
 that are not PDF are not academic papers.

This is rather circular. The view that academic papers should be fixed in
form and format is rather out of sync with the emergence of new forms of
scholarly expression enabled by the web. Here is an interesting commentary
in a recent THE:

 Academics in the humanities and social sciences need to question whether
the current narrowly conceived conventions of academic publication are in
our best interests. If reality is multifaceted, then writing that responds
to it needs to be multifaceted, too. Academics should be encouraged to
explore a heterogeneous range of formats, reaching different audiences and
finding new ways to write about research.
http://www.timeshighereducation.co.uk/story.asp?storyCode=411466sectioncode=26

I think this discussion raises a fundamental question about the design of
IRs and their support for scholarship. IRs must do better to capture the
diversity of scholarly contribution and formats, and make them count in
meaningful way.

 The format is optimized for print or reading, not for use or reuse.
PDFs are
 slow to load and often not even readable in bandwidth-poor parts of the
 world.  They crash many browsers.  They often lack working links; when
they
 do have links, they require users to open in the same window rather
than in a
 separate window, losing the file that took so long to load.  Users can't
 deep-link to subsections.  Publishers can lock them to prevent cutting and
 pasting.  Publishers can insert scripts to make them unreadable offline or
 after a certain time.  PDFs impede text processing by users, text
mining by
 software, handicapped access (read-aloud software), and mark-up by third
 parties.
 This is an argument about what software/data formats researchers
*should* use;
 affecting their authoring and editorial processes is probably beyond the
scope
 of what we can expect from this league table.

This points to the problem with league tables in general. Much like the
league tables in the Journal Citation Report with journal ranking, such
tables gloss over what are important to different disciplinary needs and
authoring processes, and privilege quantitative measures that encourage
spurious ranking and comparison. Do we really need more output based
comparisons?

 PubMed Central scores low in the Webometric rankings because it has no
PDFs.
 It does have PDFs - it might ingest articles in XML, but it certainly
 exports them in PDF. Enquiring of Google (site:www.ncbi.nlm.nih.gov
 filetype:pdf) shows that it has about 6,690,000 PDFs.

So PMC is being penalized by the ranking system because it is dynamic?

 But PMC is one of the most populated and useful OA repositories in the
world.
 This is something that needs investigating. If I had to guess why it
ranks so
 low, it might be because no-one is linking INTO pubmed; rather they are
 linking to the original publishers.

How should we define the most useful? Should download and other usage
stats be taken into consideration, instead of only in-bound links?

 The format it uses instead of PDF, the NLM DTD coded in XML, is vastly
 superior to PDF for every scholarly purpose. I haven't had time to code my
 articles in XML.  But since even HTML is superior to PDF for purposes of
 access and reuse, I self-archive in HTML rather than PDF whenever I can.
 For the record, I completely agree with you about PDF / HTML / XHTML. If
only
 Microsoft Word (and LaTeX) had decent export facilities that produced good
 semantic HTML.

Why wait for Microsoft? What has the the open source community be doing on
this front? What about OpenOffice? Any good open source NLM DTD conversion
tools out there? Why has it taken so long?

Leslie (Chan)

 --
 Les Carr




Re: Ranking Web of Repositories: July 2010 Edition

2010-07-12 Thread Isidro F . Aguillo
Dear all:

In fact we have already take into account some of your comments in the last
editions of the ranking. Let me explain:

- The ranking is based on a ratio 1:1 between ACTIVITY and VISIBILITY, so it is
as important as publishing a lot of OA papers doing it in a way others
(worldwide) can recover, use and link them. The ratio 1:1 means the weight of
each is 50%. As stated in previous messages, Visibility is measured counting the
total number of external inlinks.

- Regarding activity, we decided to follow your advices so the value is
calculated giving more or less the same value to these three variables:

* Number of papers, usually full text articles, using as a proxy the number of
items from Google Scholar
* Number of web pages: ALL the webpages (usually html or similar ones, but also
other formats)  of the website
* Number of documents: A subset of the former, those files in rich format like
pdf, ps, doc or ppt. It is probably true that pdf is not the best format and
perhaps we should consider other formats, but people are not using other
formats. The number of files in OpenOffice formats, XML, or others are
negligible, useless for ranking purposes.

- PMC. Our policy is not to rank repositories without its own domain or
subdomain. There are technical reasons but also visibility ones. The address of
PMC is absurdly complex:

www.ncbi.nlm.nih.gov/pmc

Regarding UK PMC they are included in the ranking but its position is delayed
because they do not use suffixes in their file's names. They have hundreds of
thousands of Adobe Acrobat (pdf) files without making them as *.pdf. This avoid
an efficient filtering by file type by major search engines.

Best regards,




El 11/07/2010 15:21, Peter Suber escribió:
  Hi Les:  You're arguing that Webometrics should count PDFs, and I
  fully agree.  I was only arguing that Webometrics should not *limit*
  its count to PDFs.  Sorry if I didn't make that clear.
BTW, I'd make the analogous case to publishers.  Publish in PDF if you
like, but never publish in PDF-only.  If you offer PDF editions, then also
offer XML or HTML editions.

     Best,      Peter

Peter Suber
www.bit.ly/suber

--

On Sun, Jul 11, 2010 at 6:49 AM, Leslie Carr l...@ecs.soton.ac.uk wrote:
  On 10 Jul 2010, at 15:37, Peter Suber wrote:
  For more detail on rich media or rich files, see the
  Webometrics page on methodology:  Only the number of
  text files in Acrobat format (.pdf) ... are
  consideredThis is a bug, not a feature.  A more
  useful ranking would try to count full-text scholarly or
  peer-reviewed articles regardless of format.  I know
  that's hard to do.  But it's a mistake to use any format
  as a surrogate for that status, and especially a format
  as flawed as PDF. Even if Webometrics wanted to reward
  some formats more than others, it should not reward PDF.

I think it should. The overwhelming majority of academic papers are
distributed online as PDF; the overwhelming majority of things in
repositories that are not PDF are not academic papers.

  The format is optimized for print or reading, not for
  use or reuse.  PDFs are slow to load and often not even
  readable in bandwidth-poor parts of the world.  They
  crash many browsers.  They often lack working links;
  when they do have links, they require users to open in
  the same window rather than in a separate window, losing
  the file that took so long to load.  Users can't
  deep-link to subsections.  Publishers can lock them to
  prevent cutting and pasting.  Publishers can insert
  scripts to make them unreadable offline or after a
  certain time.  PDFs impede text processing by users,
  text mining by software, handicapped access
  (read-aloud software), and mark-up by third parties.

This is an argument about what software/data formats researchers
*should* use; affecting their authoring and editorial processes is
probably beyond the scope of what we can expect from this league
table.

  PubMed Central scores low in the Webometric rankings
  because it has no PDFs.

It does have PDFs - it might ingest articles in XML, but it
certainly exports them in PDF. Enquiring of Google
(site:www.ncbi.nlm.nih.gov filetype:pdf) shows that it has about
6,690,000 PDFs.

  But PMC is one of the most populated and useful OA
  repositories in the world.

This is something that needs investigating. If I had to guess why it
ranks so low, it might be because no-one is linking INTO pubmed;
rather they are linking to the original publishers. 

  The format it uses instead of PDF, the NLM DTD coded in
  XML, is vastly superior to PDF for every scholarly
  purpose. I haven't had time to code my articles in XML.
   But since even HTML is superior to PDF for purposes of
  access and reuse, I self-archive in HTML rather than PDF
  whenever I can.

For the 

Re: Ranking Web of Repositories: July 2010 Edition

2010-07-12 Thread Isidro F . Aguillo
  El 12/07/2010 7:25, Leslie Chan escribió:
 On 7/11/10 6:49 AM, Leslie Carrl...@ecs.soton.ac.uk  wrote:

 On 10 Jul 2010, at 15:37, Peter Suber wrote:
 For more detail on rich media or rich files, see the Webometrics
 page on
 methodology:  Only the number of text files in Acrobat format (.pdf)
 ... are
 consideredThis is a bug, not a feature.  A more useful ranking
 would try
 to count full-text scholarly or peer-reviewed articles regardless of
 format.
 I know that's hard to do.  But it's a mistake to use any format as a
 surrogate for that status, and especially a format as flawed as PDF.
 Even if
 Webometrics wanted to reward some formats more than others, it should not
 reward PDF.
 I think it should. The overwhelming majority of academic papers are
 distributed online as PDF; the overwhelming majority of things in
 repositories
 that are not PDF are not academic papers.
 This is rather circular. The view that academic papers should be fixed in
 form and format is rather out of sync with the emergence of new forms of
 scholarly expression enabled by the web. Here is an interesting commentary
 in a recent THE:

  Academics in the humanities and social sciences need to question whether
 the current narrowly conceived conventions of academic publication are in
 our best interests. If reality is multifaceted, then writing that responds
 to it needs to be multifaceted, too. Academics should be encouraged to
 explore a heterogeneous range of formats, reaching different audiences and
 finding new ways to write about research.
 http://www.timeshighereducation.co.uk/story.asp?storyCode=411466sectioncode=26

 I think this discussion raises a fundamental question about the design of
 IRs and their support for scholarship. IRs must do better to capture the
 diversity of scholarly contribution and formats, and make them count in
 meaningful way.
Dear Leslie:

You are completely right, others formats should be used, far better than 
others currently available and of course open source/open access. Now 
try to convince 1 billion Internet users to do that. NOBODY (well, a few 
thousands) is using these other formats today (yet). I have the figures.

 The format is optimized for print or reading, not for use or reuse.
 PDFs are
 slow to load and often not even readable in bandwidth-poor parts of the
 world.  They crash many browsers.  They often lack working links; when
 they
 do have links, they require users to open in the same window rather
 than in a
 separate window, losing the file that took so long to load.  Users can't
 deep-link to subsections.  Publishers can lock them to prevent cutting and
 pasting.  Publishers can insert scripts to make them unreadable offline or
 after a certain time.  PDFs impede text processing by users, text
 mining by
 software, handicapped access (read-aloud software), and mark-up by third
 parties.
 This is an argument about what software/data formats researchers
 *should* use;
 affecting their authoring and editorial processes is probably beyond the
 scope
 of what we can expect from this league table.
 This points to the problem with league tables in general. Much like the
 league tables in the Journal Citation Report with journal ranking, such
 tables gloss over what are important to different disciplinary needs and
 authoring processes, and privilege quantitative measures that encourage
 spurious ranking and comparison. Do we really need more output based
 comparisons?
I have only a few numbers to support the need of league tables. QS, 
the former editors of the THES Ranking of Universities, stated they 
received 18 million visitors per year, our Web Ranking is close to 5 
million and probably the Shanghai ranking reach similar or even higher 
levels.

 PubMed Central scores low in the Webometric rankings because it has no
 PDFs.
 It does have PDFs - it might ingest articles in XML, but it certainly
 exports them in PDF. Enquiring of Google (site:www.ncbi.nlm.nih.gov
 filetype:pdf) shows that it has about 6,690,000 PDFs.
 So PMC is being penalized by the ranking system because it is dynamic?

Nobody is saying that. PMC is excluded because it have not its own 
domain or subdomain. You can disagree but I dislike my papers being 
url-authored by the library.

 But PMC is one of the most populated and useful OA repositories in the
 world.
 This is something that needs investigating. If I had to guess why it
 ranks so
 low, it might be because no-one is linking INTO pubmed; rather they are
 linking to the original publishers.
 How should we define the most useful? Should download and other usage
 stats be taken into consideration, instead of only in-bound links?
As soon as (standardized) user statistics become available they will be 
used. Good indicators need to be useful but also feasible.


 The format it uses instead of PDF, the NLM DTD coded in XML, is vastly
 superior to PDF for every scholarly purpose. I haven't had time to code my
 articles in XML.  But since even 

Re: Ranking Web of Repositories: July 2010 Edition

2010-07-12 Thread Leslie Carr
On 12 Jul 2010, at 06:25, Leslie Chan wrote:

 This is rather circular. The view that academic papers should be fixed in
 form and format is rather out of sync with the emergence of new forms of
 scholarly expression enabled by the web.
I don't wish to argue that academic writing SHOULD BE fixed in format, merely 
to observe that IT IS predominantly so.

  Academics should be encouraged to
 explore a heterogeneous range of formats, reaching different audiences and
 finding new ways to write about research.
When they do, we'll find a way to measure it :-)
If you believe they are in a significant way, let's do it!

 I think this discussion raises a fundamental question about the design of
 IRs and their support for scholarship. IRs must do better to capture the
 diversity of scholarly contribution and formats, and make them count in
 meaningful way.
I wholeheartedly concur.

 Do we really need more output based comparisons?
We need a range of comparisons of many sorts to get as full a picture as 
possible.

 How should we define the most useful? Should download and other usage
 stats be taken into consideration, instead of only in-bound links?
If we had access to those statistics, by all means lets use them.

 Why wait for Microsoft? What has the the open source community be doing on
 this front? What about OpenOffice? Any good open source NLM DTD conversion
 tools out there? Why has it taken so long?

If there was something for open office then it would be trivial for 
repositories to apply it to Microsoft Word documents.
--
Les Carr


Re: Ranking Web of Repositories: July 2010 Edition

2010-07-12 Thread Frederic MERCEUR

Hello,

Personally I am feeling uncomfortable with this ranking because, to my mind, it
is uncomplete and unprecise.

It is uncomplete because the repositories hosted on the subdirectory are not
ranked (e.g : www.xxx.zz/repository) for technical reasons, even if, as Isidro
noted, the number of these repositories is far lower than the
non-repositories listed in ROAR and OpenDOAR.

It is unprecise because it is based on web automatic commands that are very
sensitive to noise. For example, it is the case for the visibility indicator
(external inlinks). As far as I understand from Isidro explanations, a part of
this indicator is calculated with the yahoo linkdomain function :

linkdomain:http://my_site –site:my_site  

I tested this function on a few repositories ranked including our one. More than
90% (and, in some cases, I guess more than 99% inlinks) are not significant
because they come from :

-    automatic spam web site (e.g: www.find-pdf.com, www.mypdffiles.com,... 
or
automatic site such as http://www.123people.fr )
-    automatic links from OAI harvesters
-    automatic links that comes from other domains of the university (e.g. :
auto-citation through automatic personnal author’s pages)...
-    automatic repetition of the same link : in some forums, a link on the 
main
page will be duplicated automatically on all archive pages so, with one manual
significant link you can have several hundred of unsignificant automatic links.
-    …

The other indicators (size, rich files, scholar) may also be hazardous for
similar reasons.

According to Isidro, all these points affect the numbers but not (much) the
ranking. This should be confirmed...

Kind regards,
Fred

 

 



Isidro F. Aguillo a écrit :
  Dear all:

  In fact we have already take into account some of your comments in
  the last editions of the ranking. Let me explain:

  - The ranking is based on a ratio 1:1 between ACTIVITY and
  VISIBILITY, so it is as important as publishing a lot of OA papers
  doing it in a way others (worldwide) can recover, use and link them.
  The ratio 1:1 means the weight of each is 50%. As stated in previous
  messages, Visibility is measured counting the total number of
  external inlinks.

  - Regarding activity, we decided to follow your advices so the value
  is calculated giving more or less the same value to these three
  variables:

  * Number of papers, usually full text articles, using as a proxy the
  number of items from Google Scholar
  * Number of web pages: ALL the webpages (usually html or similar
  ones, but also other formats)  of the website
  * Number of documents: A subset of the former, those files in rich
  format like pdf, ps, doc or ppt. It is probably true that pdf is not
  the best format and perhaps we should consider other formats, but
  people are not using other formats. The number of files in
  OpenOffice formats, XML, or others are negligible, useless for
  ranking purposes.

  - PMC. Our policy is not to rank repositories without its own domain
  or subdomain. There are technical reasons but also visibility ones.
  The address of PMC is absurdly complex:

  www.ncbi.nlm.nih.gov/pmc

  Regarding UK PMC they are included in the ranking but its position
  is delayed because they do not use suffixes in their file's names.
  They have hundreds of thousands of Adobe Acrobat (pdf) files without
  making them as *.pdf. This avoid an efficient filtering by file type
  by major search engines.

  Best regards,


--
Fred Merceur
Ifremer / Bibliothèque La Pérouse
frederic.merc...@ifremer.fr
Tél : 02-98-49-88-69
Fax : 02-98-49-88-84
Archimer, Ifremer's Institutional Repository
Avano, a marine and aquatic OAI harvester
Bibliothèque La Pérouse

Avant d'imprimer, pensez à l'environnement!




Re: Ranking Web of Repositories: July 2010 Edition

2010-07-11 Thread Peter Suber
Hi Les:  You're arguing that Webometrics should count PDFs, and I fully agree.
 I was only arguing that Webometrics should not *limit* its count to PDFs.
 Sorry if I didn't make that clear.
BTW, I'd make the analogous case to publishers.  Publish in PDF if you like, 
but
never publish in PDF-only.  If you offer PDF editions, then also offer XML or
HTML editions.

     Best,      Peter

Peter Suber
www.bit.ly/suber

--

On Sun, Jul 11, 2010 at 6:49 AM, Leslie Carr l...@ecs.soton.ac.uk wrote:
  On 10 Jul 2010, at 15:37, Peter Suber wrote:
  For more detail on rich media or rich files, see the
  Webometrics page on methodology:  Only the number of text
  files in Acrobat format (.pdf) ... are consideredThis is
  a bug, not a feature.  A more useful ranking would try to
  count full-text scholarly or peer-reviewed articles regardless
  of format.  I know that's hard to do.  But it's a mistake to
  use any format as a surrogate for that status, and especially
  a format as flawed as PDF. Even if Webometrics wanted to
  reward some formats more than others, it should not reward
  PDF.

I think it should. The overwhelming majority of academic papers are
distributed online as PDF; the overwhelming majority of things in
repositories that are not PDF are not academic papers.

  The format is optimized for print or reading, not for use or
  reuse.  PDFs are slow to load and often not even readable in
  bandwidth-poor parts of the world.  They crash many browsers.
   They often lack working links; when they do have links, they
  require users to open in the same window rather than in a
  separate window, losing the file that took so long to load.
   Users can't deep-link to subsections.  Publishers can lock
  them to prevent cutting and pasting.  Publishers can insert
  scripts to make them unreadable offline or after a certain
  time.  PDFs impede text processing by users, text mining by
  software, handicapped access (read-aloud software), and
  mark-up by third parties.

This is an argument about what software/data formats researchers *should*
use; affecting their authoring and editorial processes is probably beyond
the scope of what we can expect from this league table.

  PubMed Central scores low in the Webometric rankings because
  it has no PDFs.

It does have PDFs - it might ingest articles in XML, but it certainly
exports them in PDF. Enquiring of Google (site:www.ncbi.nlm.nih.gov
filetype:pdf) shows that it has about 6,690,000 PDFs.

  But PMC is one of the most populated and useful OA
  repositories in the world.

This is something that needs investigating. If I had to guess why it ranks
so low, it might be because no-one is linking INTO pubmed; rather they are
linking to the original publishers. 

  The format it uses instead of PDF, the NLM DTD coded in XML,
  is vastly superior to PDF for every scholarly purpose. I
  haven't had time to code my articles in XML.  But since even
  HTML is superior to PDF for purposes of access and reuse, I
  self-archive in HTML rather than PDF whenever I can.

For the record, I completely agree with you about PDF / HTML / XHTML. If
only Microsoft Word (and LaTeX) had decent export facilities that produced
good semantic HTML.

--
Les Carr





Re: Ranking Web of Repositories: July 2010 Edition

2010-07-10 Thread Peter Suber
On Thu, Jul 8, 2010 at 8:59 AM, Leslie Carr l...@ecs.soton.ac.uk wrote:
  [...]
  If you assume that a repository is full of locally-authored research
  literature then you will find all sorts of counter-examples in one
  area or another. The Rich Media criterion goes some way to
  filtering out non-documents, but whether the items are scholarly
  or local or equivalent to those in other repositories is very
  difficult to ascertain.


For more detail on rich media or rich files, see the Webometrics page on
methodology:  Only the number of text files in Acrobat format (.pdf) ... are
considered.
http://repositories.webometrics.info/methodology_rep.html

This is a bug, not a feature.  A more useful ranking would try to count
full-text scholarly or peer-reviewed articles regardless of format.  I know
that's hard to do.  But it's a mistake to use any format as a surrogate for 
that
status, and especially a format as flawed as PDF.

Even if Webometrics wanted to reward some formats more than others, it should
not reward PDF.  The format is optimized for print or reading, not for use or
reuse.  PDFs are slow to load and often not even readable in bandwidth-poor
parts of the world.  They crash many browsers.  They often lack working links;
when they do have links, they require users to open in the same window rather
than in a separate window, losing the file that took so long to load.  Users
can't deep-link to subsections.  Publishers can lock them to prevent cutting 
and
pasting.  Publishers can insert scripts to make them unreadable offline or 
after
a certain time.  PDFs impede text processing by users, text mining by software,
handicapped access (read-aloud software), and mark-up by third parties.  

PubMed Central scores low in the Webometric rankings because it has no PDFs.
 But PMC is one of the most populated and useful OA repositories in the world.
 The format it uses instead of PDF, the NLM DTD coded in XML, is vastly 
superior
to PDF for every scholarly purpose.

I haven't had time to code my articles in XML.  But since even HTML is superior
to PDF for purposes of access and reuse, I self-archive in HTML rather than PDF
whenever I can.

     Peter

Peter Suber
www.bit.ly/suber





Re: Ranking Web of Repositories: July 2010 Edition

2010-07-09 Thread Isidro F . Aguillo
  Dear Stevan:

A lot of interesting stuff to think about. We are already working on 
some of those proposals but it is not easy. However perhaps you will 
like this page we prepared for the University rankings related to UK 
universities commitment to OA:

http://www.webometrics.info/openac.html

Thanks for your useful comments,



El 08/07/2010 18:34, Stevan Harnad escribió:
 On 2010-07-08, at 4:43 AM, Isidro F. Aguillo wrote:

 Dear Hélène:

 Thank you for your message, but I disagree with your proposal. We are not 
 measuring only contents but contents AND visibility in the web.
 Dear Isidro,

 If I may intervene with some comments too, as this discussion has some wider 
 implications:

 Yes, you are measuring both contents and visibility, but presumably you want 
 the difference between (1) the ranking of the top 800 repositories and (2) 
 the ranking of the top 800 *institutional* repositories to be based on the 
 fact that the latter are institutional repositories whereas the former are 
 all repositories (central, i.e., multi-institutional, as well as 
 institutional).

 Moreover, if you list redundant repositories (some being the proper subsets 
 of others) in the very same ranking, it seems to me the meaning of the 
 ranking becomes rather vague.

 Certainly HyperHAL covers the contents of all its participants, but the 
 impact of these contents depends of other factors. Probably researchers 
 prefer to link to the paper in INRIA because of the prestige of this 
 institution, the affiliation of the author or the marketing of their 
 institutional repository.
 All true, but perhaps the significance and usefulness of the rankings would 
 be greater if you either changed the weight of the factors (volume of 
 full-text content, number of links) or, alternatively, you designed the 
 rankings so the user could select and weight the criteria on which the 
 rankings are displayed.

 Otherwise your weightings become like the h-index -- an a-priori 
 combination of untested, unvalidated weights that many users may not be 
 satisfied with, or fully informed by...

 But here is a more important aspect. If I were the president of INRIA I will 
 prefer people using my institutional repository instead CCSD. No problem 
 with the last one, they are makinng a great job and increasing the reach of 
 INRIA, but the papers deposited are a very important (the most important?) 
 asset of INRIA.
 But how much INRIA papers are linked, downloaded and cited is not necessarily 
 (or even probably) a function of their direct locus!

 What is important for INRIA (and all institutions) is that as much as 
 possible of their paper output should be OA, simpliciter, so that it can be 
 linked, downloaded, read, applied, used and cited. It is entirely secondary, 
 for INRIA (and all institutions), *where* their papers are OA, compared to 
 the necessary condition *that* they are OA (and hence freely accessible, 
 usaeble, harvestable).

 Hence (in my view) by far the most important ranking factor for institutional 
 repositories is how much of their full-text institutional paper output is 
 indeed deposited and OA. INRIA would have no reason to be disappointed if the 
 locus from which its content is searched, retrieved and linked is some other, 
 multi-institutional harvester. INRIA still gets the credit and benefits from 
 all the links, downloads and citations of INRIA content!

 (Having said that, locus of deposit *does* matter, very much, for deposit 
 mandates, Deposit mandates are necessary in order to generate OA content. 
 And, for strategic reasons that are elaborated in my reply to Chris 
 Armbruster, it makes a big practical difference for success in agreeing on 
 the adoption of a mandate that both institutional and funder mandates should 
 require convergent *institutional* deposit, rather than divergent and 
 competing institutional vs. institution-extermal deposit. Here too, your 
 repository rankings would be much more helpful and informative if they gave a 
 greater weight to the relative size of each institutional repository's 
 content and eliminated multi-institutional repositories from the 
 institutional repository rankings -- or at least allowed institutional 
 repositories to be ranked independently on content vs links.

 I think you are perhaps being misled here by the analogy with your sister 
 rankings http://www.webometrics.info/ RWWU of universities rather than their 
 repositories In university rankings, the links to the university site itself 
 matter a lot. But in repository rankings links matter much less than *how 
 much institutional content is accessible*. For the degree of usage of that 
 content, harvester sites may be more relevant measures, and, after all, 
 downloads and citations, unlike links, carry their credits (to the authors 
 and institutions) with them no matter where the transaction happens to 
 occur...

 Regarding the other comments we are going to correct those with 

Re: Ranking Web of Repositories: July 2010 Edition

2010-07-09 Thread Leslie Carr
On 9 Jul 2010, at 08:12, Isidro F. Aguillo wrote:

 However perhaps you will like this page we prepared for the University 
 rankings related to UK universities commitment to OA:
 http://www.webometrics.info/openac.html

Thanks for preparing the page - it is very informative and helpful in answering 
questions about the interpretation of the IR ranking relating to the 
discrepancy between the relative ordering of institutions in the IR list and 
other (independent) research rankings.

As you point out, much of the difference is explained by the relative 
openness of each institution's literature. Since 50% of the score is devoted 
to in-links, and there is little motivation to link to an empty bibliographic 
record, a high proportion of OA papers will tend to attract more links, more 
traffic and hence a more impactful repository.

Some institutions have therefore benefited from their efforts to deposit OA 
papers, becoming more visible and hence more highly rated. Others are seeing 
the opposite effect -  institutions that would normally be at the top of any 
research list are much lower down than expected. Some of these institutions 
don't have very effective repositories and some do but hide them behind 
firewalls. Either way the net effect is the same - not much visible public 
literature to attract links or traffic.

I hope that the effect of this league table will be to encourage institutions 
to redouble their efforts in regard to Open Access. I also hope that it will be 
possible to have further public dialogue so that the process can be 
increasingly open and the community can better understand, verify and trust 
your metrics. 

Thanks again for your contribution!
--
Les Carr


On 9 Jul 2010, at 08:12, Isidro F. Aguillo wrote:

 Dear Stevan:
 
 A lot of interesting stuff to think about. We are already working on some of 
 those proposals but it is not easy. However perhaps you will like this page 
 we prepared for the University rankings related to UK universities commitment 
 to OA:
 
 http://www.webometrics.info/openac.html
 
 Thanks for your useful comments,
 
 
 
 El 08/07/2010 18:34, Stevan Harnad escribió:
 On 2010-07-08, at 4:43 AM, Isidro F. Aguillo wrote:
 
 Dear Hélène:
 
 Thank you for your message, but I disagree with your proposal. We are not 
 measuring only contents but contents AND visibility in the web.
 Dear Isidro,
 
 If I may intervene with some comments too, as this discussion has some wider 
 implications:
 
 Yes, you are measuring both contents and visibility, but presumably you want 
 the difference between (1) the ranking of the top 800 repositories and (2) 
 the ranking of the top 800 *institutional* repositories to be based on the 
 fact that the latter are institutional repositories whereas the former are 
 all repositories (central, i.e., multi-institutional, as well as 
 institutional).
 
 Moreover, if you list redundant repositories (some being the proper subsets 
 of others) in the very same ranking, it seems to me the meaning of the 
 ranking becomes rather vague.
 
 Certainly HyperHAL covers the contents of all its participants, but the 
 impact of these contents depends of other factors. Probably researchers 
 prefer to link to the paper in INRIA because of the prestige of this 
 institution, the affiliation of the author or the marketing of their 
 institutional repository.
 All true, but perhaps the significance and usefulness of the rankings would 
 be greater if you either changed the weight of the factors (volume of 
 full-text content, number of links) or, alternatively, you designed the 
 rankings so the user could select and weight the criteria on which the 
 rankings are displayed.
 
 Otherwise your weightings become like the h-index -- an a-priori 
 combination of untested, unvalidated weights that many users may not be 
 satisfied with, or fully informed by...
 
 But here is a more important aspect. If I were the president of INRIA I 
 will prefer people using my institutional repository instead CCSD. No 
 problem with the last one, they are makinng a great job and increasing the 
 reach of INRIA, but the papers deposited are a very important (the most 
 important?) asset of INRIA.
 But how much INRIA papers are linked, downloaded and cited is not 
 necessarily (or even probably) a function of their direct locus!
 
 What is important for INRIA (and all institutions) is that as much as 
 possible of their paper output should be OA, simpliciter, so that it can be 
 linked, downloaded, read, applied, used and cited. It is entirely secondary, 
 for INRIA (and all institutions), *where* their papers are OA, compared to 
 the necessary condition *that* they are OA (and hence freely accessible, 
 usaeble, harvestable).
 
 Hence (in my view) by far the most important ranking factor for 
 institutional repositories is how much of their full-text institutional 
 paper output is indeed deposited and OA. INRIA would have no reason to be 
 disappointed if the 

Re: Ranking Web of Repositories: July 2010 Edition

2010-07-08 Thread Hélène . Bosc
Isidro,
Thank you for your Ranking Web of World Repositories and for informing us 
about the best quality repositories!


Being French, I am delighted to see HAL so well ranked and I take this 
opportunity to congratulate Franck Laloe for having set up such a good 
national repository as well as the CCSD team for continuing to maintain and 
improve it.

Nevertheless, there is a problem in your ranking that I have already had 
occasion to point out to you in private messages.
May I remind you that:

Correction for the top 800 ranking:


The ranking should either index HyperHAL alone, or index both HAL/INRIA and 
HAL/SHS, but not all three repositories at the same time: HyperHAL includes 
both HAL/INRIA and HAL/SHS .

Correction for the ranking of institutional repositories:


Not only does HyperHAL (#1) include both HAL/INRIA (#3) and HAL/SHS (#5), as 
noted above, but HyperHAL is a multidisciplinary repository, intended to 
collect all French research output, across all institutions. Hence it should 
not be classified and ranked against individual institutional repositories 
but as a national, central repository. Indeed, even HAL/SHS is 
multi-institutional in the usual sense of the word: single universities or 
research institutions. The classification is perhaps being misled by the 
polysemous use of the word institution.


Not to seem to be biassed against my homeland, I would also point out that, 
among the top 10 of the top 800 institutional repositories, CERN (#2) is 
to a certain extent hosting multi-institutional output too, and is hence not 
strictly comparable to true single-institution repositories. In addition, 
California Institute of Technology Online Archive of California (#9) is 
misnamed -- it is the Online Archive of California http://www.oac.cdlib.org/ 
(CDLIB, not CalTech) and as such it too is multi-institutional. And Digital 
Library and Archives Virginia Tech University (#4) may also be anomalous, as 
it includes the archives of electronic journals with multi-institutional 
content. Most of the multi-institutional anomalies in the Top 800 
Institutional seem to be among the top 10 -- as one would expect if 
multiple institutional content is inflating the apparent size of a 
repository. Beyond the top 10 or so, the repositories look to be mostly true 
institutional ones.


I hope that this will help in improving the next release of your 
increasingly useful ranking!


Best wishes
Hélène Bosc

- Original Message - 
From: Stevan Harnad har...@ecs.soton.ac.uk
To: american-scientist-open-access-fo...@listserver.sigmaxi.org
Sent: Tuesday, July 06, 2010 6:07 PM
Subject: Fwd: Ranking Web of Repositories: July 2010 Edition



Begin forwarded message:

From: Isidro F. Aguillo isidro.agui...@cchs.csic.es
List-Post: goal@eprints.org
List-Post: goal@eprints.org
Date: July 6, 2010 11:13:58 AM EDT
To: sigmetr...@listserv.utk.edu
Subject: [SIGMETRICS] Ranking Web of Repositories: July 2010 Edition

Ranking Web of Repositories: July 2010 Edition

The second edition of 2010 Ranking Web of Repositories has been published 
the same day OR2010 started here in Madrid. The ranking is available from 
the following URL:

http://repositories.webometrics.info/

The main novelty is the substantial increase in the number of repositories 
analyzed (close to 1000). The Top 800 are ranked according to their web 
presence and visibility. As usual thematic repositories (CiteSeer, RePEc, 
Arxiv) leads the Ranking, but the French research institutes (CNRS, INRIA, 
SHS) using HAL are very close.  Two issues have changed from previous 
editions from a methodologicall point of view:, the use of Bing's engine 
data has been discarded due to irregularities in the figures obtained and MS 
Excel files has been excluded again.

At the end of July the new edition of the Rankings of universities, research 
centers and hospitals will be published.

Comments, suggestions and additional information are greatly appreciated.

-- 
===
Isidro F. Aguillo, HonPhD
Cybermetrics Lab (3C1)
IPP-CCHS-CSIC
Albasanz, 26-28
28037 Madrid. Spain

Editor of the Rankings Web
===



Re: Ranking Web of Repositories: July 2010 Edition

2010-07-08 Thread Isidro F . Aguillo
  Dear Hélène:

Thank you for your message, but I disagree with your proposal. We are 
not measuring only contents but contents AND visibility in the web. 
Certainly HyperHAL covers the contents of all its participants, but the 
impact of these contents depends of other factors. Probably researchers 
prefer to link to the paper in INRIA because of the prestige of this 
institution, the affiliation of the author or the marketing of their 
institutional repository.
But here is a more important aspect. If I were the president of INRIA I 
will prefer people using my institutional repository instead CCSD. No 
problem with the last one, they are makinng a great job and increasing 
the reach of INRIA, but the papers deposited are a very important (the 
most important?) asset of INRIA.

Regarding the other comments we are going to correct those with mistakes 
but it is very difficult for us to realize that Virginia Tech University 
is faking its institutional repository with contents authored by 
external scholars.

Best regards,





El 07/07/2010 23:03, Hélène.Bosc escribió:
 Isidro,
 Thank you for your Ranking Web of World Repositories and for informing 
 us about the best quality repositories!


 Being French, I am delighted to see HAL so well ranked and I take this 
 opportunity to congratulate Franck Laloe for having set up such a good 
 national repository as well as the CCSD team for continuing to 
 maintain and improve it.

 Nevertheless, there is a problem in your ranking that I have already 
 had occasion to point out to you in private messages.
 May I remind you that:

 Correction for the top 800 ranking:


 The ranking should either index HyperHAL alone, or index both 
 HAL/INRIA and HAL/SHS, but not all three repositories at the same 
 time: HyperHAL includes both HAL/INRIA and HAL/SHS .

 Correction for the ranking of institutional repositories:


 Not only does HyperHAL (#1) include both HAL/INRIA (#3) and HAL/SHS 
 (#5), as noted above, but HyperHAL is a multidisciplinary repository, 
 intended to collect all French research output, across all 
 institutions. Hence it should not be classified and ranked against 
 individual institutional repositories but as a national, central 
 repository. Indeed, even HAL/SHS is multi-institutional in the usual 
 sense of the word: single universities or research institutions. The 
 classification is perhaps being misled by the polysemous use of the 
 word institution.


 Not to seem to be biassed against my homeland, I would also point out 
 that, among the top 10 of the top 800 institutional repositories, 
 CERN (#2) is to a certain extent hosting multi-institutional output 
 too, and is hence not strictly comparable to true single-institution 
 repositories. In addition, California Institute of Technology Online 
 Archive of California (#9) is misnamed -- it is the Online Archive of 
 California http://www.oac.cdlib.org/ (CDLIB, not CalTech) and as such 
 it too is multi-institutional. And Digital Library and Archives 
 Virginia Tech University (#4) may also be anomalous, as it includes 
 the archives of electronic journals with multi-institutional content. 
 Most of the multi-institutional anomalies in the Top 800 
 Institutional seem to be among the top 10 -- as one would expect if 
 multiple institutional content is inflating the apparent size of a 
 repository. Beyond the top 10 or so, the repositories look to be 
 mostly true institutional ones.


 I hope that this will help in improving the next release of your 
 increasingly useful ranking!


 Best wishes
 Hélène Bosc

 - Original Message - From: Stevan Harnad 
 har...@ecs.soton.ac.uk
 To: american-scientist-open-access-fo...@listserver.sigmaxi.org
 Sent: Tuesday, July 06, 2010 6:07 PM
 Subject: Fwd: Ranking Web of Repositories: July 2010 Edition



 Begin forwarded message:

 From: Isidro F. Aguillo isidro.agui...@cchs.csic.es
 Date: July 6, 2010 11:13:58 AM EDT
 To: sigmetr...@listserv.utk.edu
 Subject: [SIGMETRICS] Ranking Web of Repositories: July 2010 Edition

 Ranking Web of Repositories: July 2010 Edition

 The second edition of 2010 Ranking Web of Repositories has been 
 published the same day OR2010 started here in Madrid. The ranking is 
 available from the following URL:

 http://repositories.webometrics.info/

 The main novelty is the substantial increase in the number of 
 repositories analyzed (close to 1000). The Top 800 are ranked 
 according to their web presence and visibility. As usual thematic 
 repositories (CiteSeer, RePEc, Arxiv) leads the Ranking, but the 
 French research institutes (CNRS, INRIA, SHS) using HAL are very 
 close.  Two issues have changed from previous editions from a 
 methodologicall point of view:, the use of Bing's engine data has been 
 discarded due to irregularities in the figures obtained and MS Excel 
 files has been excluded again.

 At the end of July the new edition of the Rankings of universities, 
 research centers and hospitals will 

Re: Ranking Web of Repositories: July 2010 Edition

2010-07-08 Thread Armbruster, Chris

Hélène,

Institution is indeed not a very precise concept, but the repository ranking
will not be improved if one were to spend much time trying to decide which
repository is institutional and which is not (e.g. how about also deleting No 10
because it is only a departmental repository?). Also, it is a bad idea to define
repositories as institutional only if they restrict themselves to the output of
a single institution. We already have too many repository managers who succumb
to this kind of institutionalist logic - and reject OA content only because it
is not from their own institution.

The CSIC has a sound methodology for ranking repositories, and it not their job
to define exclusively what is an IR and what not. And in cyberspace it is much
more interesting to compare repositories according to domains and services they
offer...

Moreover, it would help if we could move beyond the often narrow understanding
of what an institutional repository is and what not  acknowledge more clearly
that a strategy of privileging institutional repositories as such has not
helped. The value  sustainability of IRs (individually, as isolated instances,
 if not embedeed in a national system) is rather limited for both scholarship
and open access. Hence, it is very welcome that more determined efforts are
underway at building viable networks of research repositories and integrate IRs
in national systems (e.g. Ireland as latest instance).

For a sustained argument, please see:

Armbruster/Romary (2010) Comparing Repository Types: Challenges and Barriers for
Subject-Based Repositories, Research Repositories, National Repository Systems
and Institutional Repositories in Serving Scholarly Communication. (accepted
for publication in IJDLS)
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1506905

Romary/Armbruster (2010) Beyond Institutional Repositories. IJDLS 1(1)44-61
http://ssrn.com/abstract=1425692

Regards, Chris

-Ursprüngliche Nachricht-
Von: American Scientist Open Access Forum im Auftrag von Hélène.Bosc
Gesendet: Mi 7/7/2010 23:03
An: american-scientist-open-access-fo...@listserver.sigmaxi.org
Betreff:      Re: Ranking Web of Repositories: July 2010 Edition

Isidro,
Thank you for your Ranking Web of World Repositories and for informing us
about the best quality repositories!


Being French, I am delighted to see HAL so well ranked and I take this
opportunity to congratulate Franck Laloe for having set up such a good
national repository as well as the CCSD team for continuing to maintain and
improve it.

Nevertheless, there is a problem in your ranking that I have already had
occasion to point out to you in private messages.
May I remind you that:

Correction for the top 800 ranking:


The ranking should either index HyperHAL alone, or index both HAL/INRIA and
HAL/SHS, but not all three repositories at the same time: HyperHAL includes
both HAL/INRIA and HAL/SHS .

Correction for the ranking of institutional repositories:


Not only does HyperHAL (#1) include both HAL/INRIA (#3) and HAL/SHS (#5), as
noted above, but HyperHAL is a multidisciplinary repository, intended to
collect all French research output, across all institutions. Hence it should
not be classified and ranked against individual institutional repositories
but as a national, central repository. Indeed, even HAL/SHS is
multi-institutional in the usual sense of the word: single universities or
research institutions. The classification is perhaps being misled by the
polysemous use of the word institution.


Not to seem to be biassed against my homeland, I would also point out that,
among the top 10 of the top 800 institutional repositories, CERN (#2) is
to a certain extent hosting multi-institutional output too, and is hence not
strictly comparable to true single-institution repositories. In addition,
California Institute of Technology Online Archive of California (#9) is
misnamed -- it is the Online Archive of California http://www.oac.cdlib.org/
(CDLIB, not CalTech) and as such it too is multi-institutional. And Digital
Library and Archives Virginia Tech University (#4) may also be anomalous, as
it includes the archives of electronic journals with multi-institutional
content. Most of the multi-institutional anomalies in the Top 800
Institutional seem to be among the top 10 -- as one would expect if
multiple institutional content is inflating the apparent size of a
repository. Beyond the top 10 or so, the repositories look to be mostly true
institutional ones.


I hope that this will help in improving the next release of your
increasingly useful ranking!


Best wishes
Hélène Bosc

- Original Message -
From: Stevan Harnad har...@ecs.soton.ac.uk
To: american-scientist-open-access-fo...@listserver.sigmaxi.org
Sent: Tuesday, July 06, 2010 6:07 PM
Subject: Fwd: Ranking Web of Repositories: July 2010 Edition



Begin forwarded message:

From: Isidro F. Aguillo isidro.agui...@cchs.csic.es
List-Post: goal@eprints.org
List-Post: goal

Re: Ranking Web of Repositories: July 2010 Edition

2010-07-08 Thread Leslie Carr
On 8 Jul 2010, at 09:43, Isidro F. Aguillo wrote:

 Regarding the other comments we are going to correct those with mistakes but 
 it is very difficult for us to realize that Virginia Tech University is 
 faking its institutional repository with contents authored by external 
 scholars.

This (and the HAL-based problems) are interpretive issues that bedevil services 
that analyse repositories.
If you assume that a repository is full of locally-authored research literature 
then you will find all sorts of counter-examples in one area or another. The 
Rich Media criterion goes some way to filtering out non-documents, but 
whether the items are scholarly or local or equivalent to those in other 
repositories is very difficult to ascertain.
--
Les Carr


 
 
 
 El 07/07/2010 23:03, Hélène.Bosc escribió:
 Isidro,
 Thank you for your Ranking Web of World Repositories and for informing us 
 about the best quality repositories!
 
 
 Being French, I am delighted to see HAL so well ranked and I take this 
 opportunity to congratulate Franck Laloe for having set up such a good 
 national repository as well as the CCSD team for continuing to maintain and 
 improve it.
 
 Nevertheless, there is a problem in your ranking that I have already had 
 occasion to point out to you in private messages.
 May I remind you that:
 
 Correction for the top 800 ranking:
 
 
 The ranking should either index HyperHAL alone, or index both HAL/INRIA and 
 HAL/SHS, but not all three repositories at the same time: HyperHAL includes 
 both HAL/INRIA and HAL/SHS .
 
 Correction for the ranking of institutional repositories:
 
 
 Not only does HyperHAL (#1) include both HAL/INRIA (#3) and HAL/SHS (#5), as 
 noted above, but HyperHAL is a multidisciplinary repository, intended to 
 collect all French research output, across all institutions. Hence it should 
 not be classified and ranked against individual institutional repositories 
 but as a national, central repository. Indeed, even HAL/SHS is 
 multi-institutional in the usual sense of the word: single universities or 
 research institutions. The classification is perhaps being misled by the 
 polysemous use of the word institution.
 
 
 Not to seem to be biassed against my homeland, I would also point out that, 
 among the top 10 of the top 800 institutional repositories, CERN (#2) is 
 to a certain extent hosting multi-institutional output too, and is hence not 
 strictly comparable to true single-institution repositories. In addition, 
 California Institute of Technology Online Archive of California (#9) is 
 misnamed -- it is the Online Archive of California http://www.oac.cdlib.org/ 
 (CDLIB, not CalTech) and as such it too is multi-institutional. And Digital 
 Library and Archives Virginia Tech University (#4) may also be anomalous, as 
 it includes the archives of electronic journals with multi-institutional 
 content. Most of the multi-institutional anomalies in the Top 800 
 Institutional seem to be among the top 10 -- as one would expect if 
 multiple institutional content is inflating the apparent size of a 
 repository. Beyond the top 10 or so, the repositories look to be mostly true 
 institutional ones.
 
 
 I hope that this will help in improving the next release of your 
 increasingly useful ranking!
 
 
 Best wishes
 Hélène Bosc
 
 - Original Message - From: Stevan Harnad har...@ecs.soton.ac.uk
 To: american-scientist-open-access-fo...@listserver.sigmaxi.org
 Sent: Tuesday, July 06, 2010 6:07 PM
 Subject: Fwd: Ranking Web of Repositories: July 2010 Edition
 
 
 
 Begin forwarded message:
 
 From: Isidro F. Aguillo isidro.agui...@cchs.csic.es
 Date: July 6, 2010 11:13:58 AM EDT
 To: sigmetr...@listserv.utk.edu
 Subject: [SIGMETRICS] Ranking Web of Repositories: July 2010 Edition
 
 Ranking Web of Repositories: July 2010 Edition
 
 The second edition of 2010 Ranking Web of Repositories has been published 
 the same day OR2010 started here in Madrid. The ranking is available from 
 the following URL:
 
 http://repositories.webometrics.info/
 
 The main novelty is the substantial increase in the number of repositories 
 analyzed (close to 1000). The Top 800 are ranked according to their web 
 presence and visibility. As usual thematic repositories (CiteSeer, RePEc, 
 Arxiv) leads the Ranking, but the French research institutes (CNRS, INRIA, 
 SHS) using HAL are very close.  Two issues have changed from previous 
 editions from a methodologicall point of view:, the use of Bing's engine 
 data has been discarded due to irregularities in the figures obtained and MS 
 Excel files has been excluded again.
 
 At the end of July the new edition of the Rankings of universities, research 
 centers and hospitals will be published.
 
 Comments, suggestions and additional information are greatly appreciated.
 
 
 
 -- 
 ===
 
 Isidro F. Aguillo, HonPhD
 Cybermetrics Lab (3C1)
 IPP-CCHS-CSIC
 Albasanz, 26-28
 28037 Madrid. Spain
 
 
 Editor of the 

Re: Ranking Web of Repositories: July 2010 Edition

2010-07-08 Thread Stevan Harnad
On 2010-07-08, at 4:43 AM, Isidro F. Aguillo wrote:

 Dear Hélène:
 
 Thank you for your message, but I disagree with your proposal. We are not 
 measuring only contents but contents AND visibility in the web.

Dear Isidro,

If I may intervene with some comments too, as this discussion has some wider 
implications:

Yes, you are measuring both contents and visibility, but presumably you want 
the difference between (1) the ranking of the top 800 repositories and (2) the 
ranking of the top 800 *institutional* repositories to be based on the fact 
that the latter are institutional repositories whereas the former are all 
repositories (central, i.e., multi-institutional, as well as institutional).

Moreover, if you list redundant repositories (some being the proper subsets of 
others) in the very same ranking, it seems to me the meaning of the ranking 
becomes rather vague. 

 Certainly HyperHAL covers the contents of all its participants, but the 
 impact of these contents depends of other factors. Probably researchers 
 prefer to link to the paper in INRIA because of the prestige of this 
 institution, the affiliation of the author or the marketing of their 
 institutional repository.

All true, but perhaps the significance and usefulness of the rankings would be 
greater if you either changed the weight of the factors (volume of full-text 
content, number of links) or, alternatively, you designed the rankings so the 
user could select and weight the criteria on which the rankings are displayed.

Otherwise your weightings become like the h-index -- an a-priori combination 
of untested, unvalidated weights that many users may not be satisfied with, or 
fully informed by...

 But here is a more important aspect. If I were the president of INRIA I will 
 prefer people using my institutional repository instead CCSD. No problem with 
 the last one, they are makinng a great job and increasing the reach of INRIA, 
 but the papers deposited are a very important (the most important?) asset of 
 INRIA.

But how much INRIA papers are linked, downloaded and cited is not necessarily 
(or even probably) a function of their direct locus! 

What is important for INRIA (and all institutions) is that as much as possible 
of their paper output should be OA, simpliciter, so that it can be linked, 
downloaded, read, applied, used and cited. It is entirely secondary, for INRIA 
(and all institutions), *where* their papers are OA, compared to the necessary 
condition *that* they are OA (and hence freely accessible, usaeble, 
harvestable). 

Hence (in my view) by far the most important ranking factor for institutional 
repositories is how much of their full-text institutional paper output is 
indeed deposited and OA. INRIA would have no reason to be disappointed if the 
locus from which its content is searched, retrieved and linked is some other, 
multi-institutional harvester. INRIA still gets the credit and benefits from 
all the links, downloads and citations of INRIA content!

(Having said that, locus of deposit *does* matter, very much, for deposit 
mandates, Deposit mandates are necessary in order to generate OA content. And, 
for strategic reasons that are elaborated in my reply to Chris Armbruster, it 
makes a big practical difference for success in agreeing on the adoption of a 
mandate that both institutional and funder mandates should require convergent 
*institutional* deposit, rather than divergent and competing institutional vs. 
institution-extermal deposit. Here too, your repository rankings would be much 
more helpful and informative if they gave a greater weight to the relative size 
of each institutional repository's content and eliminated multi-institutional 
repositories from the institutional repository rankings -- or at least allowed 
institutional repositories to be ranked independently on content vs links.

I think you are perhaps being misled here by the analogy with your sister 
rankings http://www.webometrics.info/ RWWU of universities rather than their 
repositories In university rankings, the links to the university site itself 
matter a lot. But in repository rankings links matter much less than *how much 
institutional content is accessible*. For the degree of usage of that content, 
harvester sites may be more relevant measures, and, after all, downloads and 
citations, unlike links, carry their credits (to the authors and institutions) 
with them no matter where the transaction happens to occur...

 Regarding the other comments we are going to correct those with mistakes but 
 it is very difficult for us to realize that Virginia Tech University is 
 faking its institutional repository with contents authored by external 
 scholars.

I have called Gail McMillan at Virginia Tech about this, and she has explained 
it to me. The question was never whether Virginia Tech was faking! They 
simply host content over and above Virginia Tech content -- for example, OA 
journals whose content originates 

Re: Ranking Web of Repositories: July 2010 Edition

2010-07-08 Thread Hélène . Bosc
Dear Chris Armbruster and Isidro Aguilo,
Since Stevan Harnad has the advantage of being able to read and respond to
messages first, I have nothing further to add. Had I replied first, I would have
made some of the arguments he made in support of my view on the rankings, but
it would have been done in a less clear and complete way.
 
Best wishes.
Hélène Bosc
  - Original Message -
From: Armbruster, Chris
To: american-scientist-open-access-fo...@listserver.sigmaxi.org
Sent: Thursday, July 08, 2010 11:44 AM
Subject: Re: Ranking Web of Repositories: July 2010 Edition

Hélène,

Institution is indeed not a very precise concept, but the repository
ranking will not be improved if one were to spend much time trying to
decide which repository is institutional and which is not (e.g. how about
also deleting No 10 because it is only a departmental repository?). Also,
it is a bad idea to define repositories as institutional only if they
restrict themselves to the output of a single institution. We already have
too many repository managers who succumb to this kind of institutionalist
logic - and reject OA content only because it is not from their own
institution.

The CSIC has a sound methodology for ranking repositories, and it not
their job to define exclusively what is an IR and what not. And in
cyberspace it is much more interesting to compare repositories according
to domains and services they offer...

Moreover, it would help if we could move beyond the often narrow
understanding of what an institutional repository is and what not 
acknowledge more clearly that a strategy of privileging institutional
repositories as such has not helped. The value  sustainability of IRs
(individually, as isolated instances,  if not embedeed in a national
system) is rather limited for both scholarship and open access. Hence, it
is very welcome that more determined efforts are underway at building
viable networks of research repositories and integrate IRs in national
systems (e.g. Ireland as latest instance).

For a sustained argument, please see:

Armbruster/Romary (2010) Comparing Repository Types: Challenges and
Barriers for Subject-Based Repositories, Research Repositories, National
Repository Systems and Institutional Repositories in Serving Scholarly
Communication. (accepted for publication in IJDLS)
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1506905

Romary/Armbruster (2010) Beyond Institutional Repositories. IJDLS
1(1)44-61
http://ssrn.com/abstract=1425692

Regards, Chris

-Ursprüngliche Nachricht-
Von: American Scientist Open Access Forum im Auftrag von Hélène.Bosc
Gesendet: Mi 7/7/2010 23:03
An: american-scientist-open-access-fo...@listserver.sigmaxi.org
Betreff:      Re: Ranking Web of Repositories: July 2010 Edition

Isidro,
Thank you for your Ranking Web of World Repositories and for informing us
about the best quality repositories!


Being French, I am delighted to see HAL so well ranked and I take this
opportunity to congratulate Franck Laloe for having set up such a good
national repository as well as the CCSD team for continuing to maintain
and
improve it.

Nevertheless, there is a problem in your ranking that I have already had
occasion to point out to you in private messages.
May I remind you that:

Correction for the top 800 ranking:


The ranking should either index HyperHAL alone, or index both HAL/INRIA
and
HAL/SHS, but not all three repositories at the same time: HyperHAL
includes
both HAL/INRIA and HAL/SHS .

Correction for the ranking of institutional repositories:


Not only does HyperHAL (#1) include both HAL/INRIA (#3) and HAL/SHS (#5),
as
noted above, but HyperHAL is a multidisciplinary repository, intended to
collect all French research output, across all institutions. Hence it
should
not be classified and ranked against individual institutional repositories
but as a national, central repository. Indeed, even HAL/SHS is
multi-institutional in the usual sense of the word: single universities or
research institutions. The classification is perhaps being misled by the
polysemous use of the word institution.


Not to seem to be biassed against my homeland, I would also point out
that,
among the top 10 of the top 800 institutional repositories, CERN (#2) is
to a certain extent hosting multi-institutional output too, and is hence
not
strictly comparable to true single-institution repositories. In addition,
California Institute of Technology Online Archive of California (#9) is
misnamed -- it is the Online Archive of California
http://www.oac.cdlib.org/
(CDLIB, not CalTech) and as such it too is multi-institutional. And
Digital
Library and Archives Virginia Tech University (#4) may also be anomalous,
as
it includes the archives of electronic journals with multi-institutional
content. Most of the multi-institutional anomalies in the Top 800
Institutional seem to be among the top 10 -- as one would expect if
multiple institutional content is inflating the apparent size