Anthony

Since dspace 4 you can filter by userAgent
see 
https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders
I have not used this myself and am not sure whether these filters are applied 
as crawlers access content - or whether you need to run the
[dspace]/bin/dspace stats-util command on a regular basis. You definitely need 
to run it to prune  mark usage events after you configure
a list of userAgents you want to filter against.

Monika

________________
Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544


On May 12, 2015, at 2:13 PM, Anthony Petryk 
<anthony.pet...@uottawa.ca<mailto:anthony.pet...@uottawa.ca>> wrote:

After a bit of investigation, it turns out that a significant portion of our 
items stats come from spiders.  Any thoughts on the best way to go about 
removing them from Solr retroactively?  There’s nothing that I can see in the 
code that will do this by domain or agent, only IP.  We’re not excited at the 
prospect of pulling out the IPs of all the spiders in order run “stats-util –i” 
effectively.

Cheers,

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspace-tech@lists.sourceforge.net<mailto:dspace-tech@lists.sourceforge.net>
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

I wrote a small ruby script to put solr queries together when I was poking 
around my stats

see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb
an example parameter file is 
https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml

run it as    ruby solr/solr_query.rb

of cause you ned to adjust the parameters in the mL file

you can query like this

http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false

exclude records that are marked as bots
do type:2 - aka items
do id:218 - aka item with id 218
return one item
facet on ip addresses

crank up the number of rows to get more matching docs

Monika



________________
Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544


On May 7, 2015, at 3:26 PM, Anthony Petryk 
<anthony.pet...@uottawa.ca<mailto:anthony.pet...@uottawa.ca>> wrote:

Anyway, we want to determine whether these stats are bona fide or whether 
there's something wrong with the spider detection.  From the documentation it 
seems we have to query Solr directly to do this.  Not being an expert in Solr, 
I'm hoping someone on this list could provide the query that retrieves *all the 
stats for a given item* (i.e. what's listed under "Common stored fields for all 
usage events" in the documentation).

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to