Hi all, We noticed something disturbing over the weekend: many hits from google scholar to "extracted text" bitstreams in our repository instead of the pdf articles themselves (search your logs for "pdf.txt" if you want to check if this is happening to you).
And it looks like our repository isn't the only one affected by this. Take a look at the following query and notice the "TXT at..." links on the right side. http://scholar.google.com/scholar?hl=en&q=shieber+violently+broccoli&btnG=&as_sdt=1%2C22&as_sdtp= There are three links to three separate repos which all exhibit the same behavior (links to pdf.txt extracted text bitstreams). >From examining my logs, it looks like what is happening is that google has recently started crawling METS and ORE OAI-PMH crosswalks, which expose links to the extracted text bitstreams (first record of such a crawl for us is September 17th, though we had these crosswalks enabled long before that). It seems to be using the "extract text" links in preference to the pdf articles proper. Fortunately, googlebot isn't sending followup requests with resumptionTokens, so only the first 100 records in each collection seem to have been thus mis-crawled. We turned off the mets and ore crosswalks this morning and hope this will convince google scholar not to index extract pdf text bitstreams any more, but obviously it's too soon to tell. Anyone else notice this issue/find a solution that works? It seems a little strange to me that these files should be publicly visible at all -- aren't they just for building the internal dspace full text search index? Reinhard ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech