We're having a similar problem (DSpace 4.3).

Here's an example: http://www.ruor.uottawa.ca/handle/10393/23938/statistics.  
These stats seem high for this item (although we could be wrong about that), 
but what's more peculiar is all the traffic from China.  We've run through all 
the options for the "stats-util" command but the numbers remain the same.  Note 
that when we run "stats-util - u" (update the IP lists), we always get the 
message "Not modified - so not downloaded".  I'm guessing that the spider lists 
by domain name or user-agent are more current?  

Anyway, we want to determine whether these stats are bona fide or whether 
there's something wrong with the spider detection.  From the documentation it 
seems we have to query Solr directly to do this.  Not being an expert in Solr, 
I'm hoping someone on this list could provide the query that retrieves *all the 
stats for a given item* (i.e. what's listed under "Common stored fields for all 
usage events" in the documentation).

Thanks for your time,

Anthony

Anthony Petryk
Emerging Technologies Librarian | Bibliothécaire des technologies émergentes
uOttawa Library | Bibliothèque uOttawa
613-562-5800 x4650
apet...@uottawa.ca


-----Original Message-----
From: Mark H. Wood [mailto:mw...@iupui.edu] 
Sent: Friday, April 24, 2015 9:42 AM
To: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

On Thu, Apr 23, 2015 at 05:39:01PM +0000, Monika C. Mevenkamp wrote:
> I found a couple of really suspicious numbers in my solr stats, aka lots of 
> entries were marked as isBot=false although the probably should has been 
> isBot=true.
> 
> In the config file  I use
> 
> spiderips.urls = http://iplists.com/google.txt, \
>                  http://iplists.com/inktomi.txt, \
>                  http://iplists.com/lycos.txt, \
>                  http://iplists.com/infoseek.txt, \
>                  http://iplists.com/altavista.txt, \
>                  http://iplists.com/excite.txt, \
>                  http://iplists.com/northernlight.txt, \
>                  http://iplists.com/misc.txt, \
>                  http://iplists.com/non_engines.txt
> 
> 
> I could not find downloadable lists for Bing, Baidu, Yahoo.
> The best I saw was:   
> http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html
> Is that reliable  ?
> 
> Does anybody out there have lists / sources that they can share ?

What version of DSpace are you running?  Recent versions can also recognize 
spiders by regular expression matching of the domain name or
UserAgent: string.  (However, that only works for new entries.  I've recently 
found that some of the tools for loading and grooming the stat.s core don't use 
SpiderDetector and are oblivious of these newer
patterns.)

> Also: does the dspace code gracefully deal with IP address patterns ?

That depends on what is considered graceful.  The code (in
org.dspace.statistics.util.IPTable) accepts patterns in three forms:

  11.22.33.44-11.22.33.55
  11.22.33.44
  11.22.33

Addresses in the first form may be suffixed with a CIDR mask-length, but it is 
currently ignored.

If I've understood the code, a range (the first form) is assumed to differ only 
in the fourth octet.  It will match all addresses between "44" and "55" within 
the /24 containing the start of the range.

The second form is an exact match of a single address.

The third form is a match of the first 24 bits -- an entire Class C subnet.

There is no provision for IPv6.

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to