Hi all,

We noticed something disturbing over the weekend: many hits from
google scholar to "extracted text" bitstreams in our repository
instead of the pdf articles themselves (search your logs for "pdf.txt"
if you want to check if this is happening to you).

And it looks like our repository isn't the only one affected by this.

Take a look at the following query and notice the "TXT at..." links on
the right side.

http://scholar.google.com/scholar?hl=en&q=shieber+violently+broccoli&btnG=&as_sdt=1%2C22&as_sdtp=

There are three links to three separate repos which all exhibit the
same behavior (links to pdf.txt extracted text bitstreams).

>From examining my logs, it looks like what is happening is that google
has recently started crawling METS and ORE OAI-PMH crosswalks, which
expose links to the extracted text bitstreams (first record of such a
crawl for us is September 17th, though we had these crosswalks enabled
long before that). It seems to be using the "extract text" links in
preference to the pdf articles proper. Fortunately, googlebot isn't
sending followup requests with resumptionTokens, so only the first 100
records in each collection seem to have been thus mis-crawled.

We turned off the mets and ore crosswalks this morning and hope this
will convince google scholar not to index extract pdf text bitstreams
any more, but obviously it's too soon to tell. Anyone else notice this
issue/find a solution that works?

It seems a little strange to me that these files should be publicly
visible at all -- aren't they just for building the internal dspace
full text search index?

Reinhard

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to