Hi Susan,
The one that DSpace uses is http://iplists.com. It was last updated 2
years ago. I haven’t come across another one myself, at least not in such
an easy to use format. We’ve taken to manually periodically removing the
main offenders (facet your Solr query by IP – the top ones will likely be
bots). A more up-to-date list would be welcome indeed!
Anthony
On Wednesday, February 3, 2016 at 5:22:07 PM UTC-5, Susan Borda wrote:
>
> Hi-
> Is there a reputable list of IPs or Agents? I have some download numbers
> that seem way too high.
>
> Filters I’m using in Solr: isBot:False, -dns:*bot*, -dns:*spider*. Also we
> have spider IP text files in [dspace]/config/spiders
>
> Thanks,
> s
> —
> Susan Borda
> Digital Technologies Development Librarian
> Montana State University Library
> 406-994-1873
> —
> Susan Borda
> Digital Technologies Development Librarian
> Montana State University Library
> 406-994-1873
>
>
>
> From: "Pottinger, Hardy J." >
> Date: Friday, May 15, 2015 at 7:19 AM
> To: Anthony Petryk >, "Monika C.
> Mevenkamp" >, "
> dspac...@lists.sourceforge.net " <
> dspac...@lists.sourceforge.net >
> Subject: Re: [Dspace-tech] spider ip recognition
>
> Hi, you've run into a known issue, and one I very recently wrestled with
> myself:
>
> https://jira.duraspace.org/browse/DS-2431
>
> See my last comment on that ticket, I found a way around the issue, by
> simply deleting the spider docs from the stats index via a query in the
> Solr admin interface.
>
> --Hardy
>
> --
> *From:* Anthony Petryk [anthony...@uottawa.ca ]
> *Sent:* Thursday, May 14, 2015 12:06 PM
> *To:* Monika C. Mevenkamp; dspac...@lists.sourceforge.net
> *Subject:* Re: [Dspace-tech] spider ip recognition
>
> Hi again,
>
>
>
> Unfortunately, the documentation for the stats-util command is incorrect.
> Specifically this line:
>
>
>
> *-i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS
> name, or Agent name. Will prune out all records that match spider
> identification patterns.*
>
>
>
> Running “stats-util –i” does not actually remove spiders by DNS name or
> Agent name. Here’s are the relevant sections of the code, from
> StatisticsClient.java and SolrLogger.java:
>
>
>
> (…)
>
> else if(line.hasOption('i'))
>
> {
>
> SolrLogger.deleteRobotsByIP();
>
> }
>
>
>
> public static void deleteRobotsByIP()
>
> {
>
> for(String ip : SpiderDetector.getSpiderIpAddresses()){
>
> deleteIP(ip);
>
> }
>
> }
>
>
>
> What this means is that, if a spider is in your Solr stats, there’s no way
> to remove it other than manually adding its IP to [dpsace]/config/spiders;
> adding its DNS name or Agent name to the configs will not expunge it.
> Updating the spider files with “stats-util –u” does little to help because
> the IP lists it pulls from are out of date.
>
>
>
> An example is the spider from the Bing search engine: bingbot. As of
> DSpace 4.3, it’s not in the list of spiders by DNS name or Agent name, nor
> is it in the list of spider IP addresses. So anyone running DSpace 4.3
> likely has usage stats inflated by visits from this spider. The only way
> to remove it is to specify all the IPs for bingbot. Multiply that by all
> the other “new” spiders and we’re talking about a lot of work.
>
>
>
> I tried briefly to modify the code to take domains/agents into account
> when marking or deleting spiders, but I wasn’t able to figure out how to
> query Solr with regex patterns. It’s easier to do with IPs because each IP
> or IP range is transformed into a String and used as a standard query
> parameter.
>
>
>
> Anthony
>
>
>
> *From:* Monika C. Mevenkamp [mailto:...@princeton.edu ]
> *Sent:* Thursday, May 14, 2015 11:17 AM
> *To:* Anthony Petryk
> *Cc:* Monika C. Mevenkamp; dspac...@lists.sourceforge.net
> *Subject:* Re: [Dspace-tech] spider ip recognition
>
>
>
> Anthony
>
>
>
> Since dspace 4 you can filter by userAgent
>
> see
> https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders
>
> I have not used this myself and am not sure whether these filters are
> applied as crawlers access content - or whether you need to run the
>
> [dspace]/bin/dspace stats-util command on a regular basis. You definitely
> need to run it to prune mark usage events after you configure
>
> a list of userAgents you want to filter against.
>
>
>
> Monika
>
>
>
>
>
> Monika Mevenkamp
>
> phone: 609-258-4161
>
> Princeton University, Princeton, NJ 08544
>
>
>
> On May 12, 2015, at 2:13 PM, Anthony Petryk > wrote:
>
>
>
> After a bit of investigation, it turns out that a significant portion of
> our items stats come from spiders. Any thoughts on the best way to go
> about removing them from Solr retroactively? There’s nothing that I can
> see in the code that will do this by domai