Hi, you've run into a known issue, and one I very recently wrestled with myself:
https://jira.duraspace.org/browse/DS-2431
See my last comment on that ticket, I found a way around the issue, by simply
deleting the spider docs from the stats index via a query in the Solr admin
interface.
--Hardy
________________________________
From: Anthony Petryk [anthony.pet...@uottawa.ca]
Sent: Thursday, May 14, 2015 12:06 PM
To: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition
Hi again,
Unfortunately, the documentation for the stats-util command is incorrect.
Specifically this line:
-i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS name,
or Agent name. Will prune out all records that match spider identification
patterns.
Running “stats-util –i” does not actually remove spiders by DNS name or Agent
name. Here’s are the relevant sections of the code, from StatisticsClient.java
and SolrLogger.java:
(…)
else if(line.hasOption('i'))
{
SolrLogger.deleteRobotsByIP();
}
public static void deleteRobotsByIP()
{
for(String ip : SpiderDetector.getSpiderIpAddresses()){
deleteIP(ip);
}
}
What this means is that, if a spider is in your Solr stats, there’s no way to
remove it other than manually adding its IP to [dpsace]/config/spiders; adding
its DNS name or Agent name to the configs will not expunge it. Updating the
spider files with “stats-util –u” does little to help because the IP lists it
pulls from are out of date.
An example is the spider from the Bing search engine: bingbot. As of DSpace
4.3, it’s not in the list of spiders by DNS name or Agent name, nor is it in
the list of spider IP addresses. So anyone running DSpace 4.3 likely has usage
stats inflated by visits from this spider. The only way to remove it is to
specify all the IPs for bingbot. Multiply that by all the other “new” spiders
and we’re talking about a lot of work.
I tried briefly to modify the code to take domains/agents into account when
marking or deleting spiders, but I wasn’t able to figure out how to query Solr
with regex patterns. It’s easier to do with IPs because each IP or IP range is
transformed into a String and used as a standard query parameter.
Anthony
From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Thursday, May 14, 2015 11:17 AM
To: Anthony Petryk
Cc: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition
Anthony
Since dspace 4 you can filter by userAgent
see
https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders
I have not used this myself and am not sure whether these filters are applied
as crawlers access content - or whether you need to run the
[dspace]/bin/dspace stats-util command on a regular basis. You definitely need
to run it to prune mark usage events after you configure
a list of userAgents you want to filter against.
Monika
________________
Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544
On May 12, 2015, at 2:13 PM, Anthony Petryk
<anthony.pet...@uottawa.ca<mailto:anthony.pet...@uottawa.ca>> wrote:
After a bit of investigation, it turns out that a significant portion of our
items stats come from spiders. Any thoughts on the best way to go about
removing them from Solr retroactively? There’s nothing that I can see in the
code that will do this by domain or agent, only IP. We’re not excited at the
prospect of pulling out the IPs of all the spiders in order run “stats-util –i”
effectively.
Cheers,
Anthony
From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspace-tech@lists.sourceforge.net<mailto:dspace-tech@lists.sourceforge.net>
Subject: Re: [Dspace-tech] spider ip recognition
Anthony
I wrote a small ruby script to put solr queries together when I was poking
around my stats
see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb
an example parameter file is
https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml
run it as ruby solr/solr_query.rb
of cause you ned to adjust the parameters in the mL file
you can query like this
http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false<UrlBlockedError.aspx>
exclude records that are marked as bots
do type:2 - aka items
do id:218 - aka item with id 218
return one item
facet on ip addresses
crank up the number of rows to get more matching docs
Monika
________________
Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544
On May 7, 2015, at 3:26 PM, Anthony Petryk
<anthony.pet...@uottawa.ca<mailto:anthony.pet...@uottawa.ca>> wrote:
Anyway, we want to determine whether these stats are bona fide or whether
there's something wrong with the spider detection. From the documentation it
seems we have to query Solr directly to do this. Not being an expert in Solr,
I'm hoping someone on this list could provide the query that retrieves *all the
stats for a given item* (i.e. what's listed under "Common stored fields for all
usage events" in the documentation).
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette