Re: [Dspace-tech] spider ip recognition
Hi, you've run into a known issue, and one I very recently wrestled with myself: https://jira.duraspace.org/browse/DS-2431 See my last comment on that ticket, I found a way around the issue, by simply deleting the spider docs from the stats index via a query in the Solr admin interface. --Hardy From: Anthony Petryk [anthony.pet...@uottawa.ca] Sent: Thursday, May 14, 2015 12:06 PM To: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] spider ip recognition Hi again, Unfortunately, the documentation for the stats-util command is incorrect. Specifically this line: -i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS name, or Agent name. Will prune out all records that match spider identification patterns. Running “stats-util –i” does not actually remove spiders by DNS name or Agent name. Here’s are the relevant sections of the code, from StatisticsClient.java and SolrLogger.java: (…) else if(line.hasOption('i')) { SolrLogger.deleteRobotsByIP(); } public static void deleteRobotsByIP() { for(String ip : SpiderDetector.getSpiderIpAddresses()){ deleteIP(ip); } } What this means is that, if a spider is in your Solr stats, there’s no way to remove it other than manually adding its IP to [dpsace]/config/spiders; adding its DNS name or Agent name to the configs will not expunge it. Updating the spider files with “stats-util –u” does little to help because the IP lists it pulls from are out of date. An example is the spider from the Bing search engine: bingbot. As of DSpace 4.3, it’s not in the list of spiders by DNS name or Agent name, nor is it in the list of spider IP addresses. So anyone running DSpace 4.3 likely has usage stats inflated by visits from this spider. The only way to remove it is to specify all the IPs for bingbot. Multiply that by all the other “new” spiders and we’re talking about a lot of work. I tried briefly to modify the code to take domains/agents into account when marking or deleting spiders, but I wasn’t able to figure out how to query Solr with regex patterns. It’s easier to do with IPs because each IP or IP range is transformed into a String and used as a standard query parameter. Anthony From: Monika C. Mevenkamp [mailto:moni...@princeton.edu] Sent: Thursday, May 14, 2015 11:17 AM To: Anthony Petryk Cc: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] spider ip recognition Anthony Since dspace 4 you can filter by userAgent see https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders I have not used this myself and am not sure whether these filters are applied as crawlers access content - or whether you need to run the [dspace]/bin/dspace stats-util command on a regular basis. You definitely need to run it to prune mark usage events after you configure a list of userAgents you want to filter against. Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 12, 2015, at 2:13 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: After a bit of investigation, it turns out that a significant portion of our items stats come from spiders. Any thoughts on the best way to go about removing them from Solr retroactively? There’s nothing that I can see in the code that will do this by domain or agent, only IP. We’re not excited at the prospect of pulling out the IPs of all the spiders in order run “stats-util –i” effectively. Cheers, Anthony From: Monika C. Mevenkamp [mailto:moni...@princeton.edu] Sent: Friday, May 08, 2015 9:59 AM To: Anthony Petryk Cc: dspace-tech@lists.sourceforge.net<mailto:dspace-tech@lists.sourceforge.net> Subject: Re: [Dspace-tech] spider ip recognition Anthony I wrote a small ruby script to put solr queries together when I was poking around my stats see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb an example parameter file is https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml run it asruby solr/solr_query.rb of cause you ned to adjust the parameters in the mL file you can query like this http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false exclude records that are marked as bots do type:2 - aka items do id:218 - aka item with id 218 return one item facet on ip addresses crank up the number of rows to get more matching docs Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 7, 2015, at 3:26 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: Anyway, we want to determine whether these stats are bona fide or whether there's somethi
Re: [Dspace-tech] spider ip recognition
Thanks for the info Hardy. I just discovered that Jira issue yesterday. I'll probably use your approach for our own stats, but I'm sure other sites would benefit from domain/agent handling when running "stats-util -m" or "stats-util -i" (as described in the issue). Best, Anthony From: Pottinger, Hardy J. [mailto:pottinge...@missouri.edu] Sent: Friday, May 15, 2015 10:19 AM To: Anthony Petryk; Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net Subject: RE: [Dspace-tech] spider ip recognition Hi, you've run into a known issue, and one I very recently wrestled with myself: https://jira.duraspace.org/browse/DS-2431 See my last comment on that ticket, I found a way around the issue, by simply deleting the spider docs from the stats index via a query in the Solr admin interface. --Hardy From: Anthony Petryk [anthony.pet...@uottawa.ca] Sent: Thursday, May 14, 2015 12:06 PM To: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net<mailto:dspace-tech@lists.sourceforge.net> Subject: Re: [Dspace-tech] spider ip recognition Hi again, Unfortunately, the documentation for the stats-util command is incorrect. Specifically this line: -i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS name, or Agent name. Will prune out all records that match spider identification patterns. Running "stats-util -i" does not actually remove spiders by DNS name or Agent name. Here's are the relevant sections of the code, from StatisticsClient.java and SolrLogger.java: (...) else if(line.hasOption('i')) { SolrLogger.deleteRobotsByIP(); } public static void deleteRobotsByIP() { for(String ip : SpiderDetector.getSpiderIpAddresses()){ deleteIP(ip); } } What this means is that, if a spider is in your Solr stats, there's no way to remove it other than manually adding its IP to [dpsace]/config/spiders; adding its DNS name or Agent name to the configs will not expunge it. Updating the spider files with "stats-util -u" does little to help because the IP lists it pulls from are out of date. An example is the spider from the Bing search engine: bingbot. As of DSpace 4.3, it's not in the list of spiders by DNS name or Agent name, nor is it in the list of spider IP addresses. So anyone running DSpace 4.3 likely has usage stats inflated by visits from this spider. The only way to remove it is to specify all the IPs for bingbot. Multiply that by all the other "new" spiders and we're talking about a lot of work. I tried briefly to modify the code to take domains/agents into account when marking or deleting spiders, but I wasn't able to figure out how to query Solr with regex patterns. It's easier to do with IPs because each IP or IP range is transformed into a String and used as a standard query parameter. Anthony From: Monika C. Mevenkamp [mailto:moni...@princeton.edu] Sent: Thursday, May 14, 2015 11:17 AM To: Anthony Petryk Cc: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net<mailto:dspace-tech@lists.sourceforge.net> Subject: Re: [Dspace-tech] spider ip recognition Anthony Since dspace 4 you can filter by userAgent see https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders I have not used this myself and am not sure whether these filters are applied as crawlers access content - or whether you need to run the [dspace]/bin/dspace stats-util command on a regular basis. You definitely need to run it to prune mark usage events after you configure a list of userAgents you want to filter against. Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 12, 2015, at 2:13 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: After a bit of investigation, it turns out that a significant portion of our items stats come from spiders. Any thoughts on the best way to go about removing them from Solr retroactively? There's nothing that I can see in the code that will do this by domain or agent, only IP. We're not excited at the prospect of pulling out the IPs of all the spiders in order run "stats-util -i" effectively. Cheers, Anthony From: Monika C. Mevenkamp [mailto:moni...@princeton.edu] Sent: Friday, May 08, 2015 9:59 AM To: Anthony Petryk Cc: dspace-tech@lists.sourceforge.net<mailto:dspace-tech@lists.sourceforge.net> Subject: Re: [Dspace-tech] spider ip recognition Anthony I wrote a small ruby script to put solr queries together when I was poking around my stats see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb an example parameter file is https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml run it asruby solr/solr_query.rb of cause you ned to a
Re: [Dspace-tech] spider ip recognition
Hi again, Unfortunately, the documentation for the stats-util command is incorrect. Specifically this line: -i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS name, or Agent name. Will prune out all records that match spider identification patterns. Running “stats-util –i” does not actually remove spiders by DNS name or Agent name. Here’s are the relevant sections of the code, from StatisticsClient.java and SolrLogger.java: (…) else if(line.hasOption('i')) { SolrLogger.deleteRobotsByIP(); } public static void deleteRobotsByIP() { for(String ip : SpiderDetector.getSpiderIpAddresses()){ deleteIP(ip); } } What this means is that, if a spider is in your Solr stats, there’s no way to remove it other than manually adding its IP to [dpsace]/config/spiders; adding its DNS name or Agent name to the configs will not expunge it. Updating the spider files with “stats-util –u” does little to help because the IP lists it pulls from are out of date. An example is the spider from the Bing search engine: bingbot. As of DSpace 4.3, it’s not in the list of spiders by DNS name or Agent name, nor is it in the list of spider IP addresses. So anyone running DSpace 4.3 likely has usage stats inflated by visits from this spider. The only way to remove it is to specify all the IPs for bingbot. Multiply that by all the other “new” spiders and we’re talking about a lot of work. I tried briefly to modify the code to take domains/agents into account when marking or deleting spiders, but I wasn’t able to figure out how to query Solr with regex patterns. It’s easier to do with IPs because each IP or IP range is transformed into a String and used as a standard query parameter. Anthony From: Monika C. Mevenkamp [mailto:moni...@princeton.edu] Sent: Thursday, May 14, 2015 11:17 AM To: Anthony Petryk Cc: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] spider ip recognition Anthony Since dspace 4 you can filter by userAgent see https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders I have not used this myself and am not sure whether these filters are applied as crawlers access content - or whether you need to run the [dspace]/bin/dspace stats-util command on a regular basis. You definitely need to run it to prune mark usage events after you configure a list of userAgents you want to filter against. Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 12, 2015, at 2:13 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: After a bit of investigation, it turns out that a significant portion of our items stats come from spiders. Any thoughts on the best way to go about removing them from Solr retroactively? There’s nothing that I can see in the code that will do this by domain or agent, only IP. We’re not excited at the prospect of pulling out the IPs of all the spiders in order run “stats-util –i” effectively. Cheers, Anthony From: Monika C. Mevenkamp [mailto:moni...@princeton.edu] Sent: Friday, May 08, 2015 9:59 AM To: Anthony Petryk Cc: dspace-tech@lists.sourceforge.net<mailto:dspace-tech@lists.sourceforge.net> Subject: Re: [Dspace-tech] spider ip recognition Anthony I wrote a small ruby script to put solr queries together when I was poking around my stats see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb an example parameter file is https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml run it asruby solr/solr_query.rb of cause you ned to adjust the parameters in the mL file you can query like this http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false exclude records that are marked as bots do type:2 - aka items do id:218 - aka item with id 218 return one item facet on ip addresses crank up the number of rows to get more matching docs Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 7, 2015, at 3:26 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: Anyway, we want to determine whether these stats are bona fide or whether there's something wrong with the spider detection. From the documentation it seems we have to query Solr directly to do this. Not being an expert in Solr, I'm hoping someone on this list could provide the query that retrieves *all the stats for a given item* (i.e. what's listed under "Common stored fields for all usage events" in the documentation). -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-th
Re: [Dspace-tech] spider ip recognition
Anthony Since dspace 4 you can filter by userAgent see https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders I have not used this myself and am not sure whether these filters are applied as crawlers access content - or whether you need to run the [dspace]/bin/dspace stats-util command on a regular basis. You definitely need to run it to prune mark usage events after you configure a list of userAgents you want to filter against. Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 12, 2015, at 2:13 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: After a bit of investigation, it turns out that a significant portion of our items stats come from spiders. Any thoughts on the best way to go about removing them from Solr retroactively? There’s nothing that I can see in the code that will do this by domain or agent, only IP. We’re not excited at the prospect of pulling out the IPs of all the spiders in order run “stats-util –i” effectively. Cheers, Anthony From: Monika C. Mevenkamp [mailto:moni...@princeton.edu] Sent: Friday, May 08, 2015 9:59 AM To: Anthony Petryk Cc: dspace-tech@lists.sourceforge.net<mailto:dspace-tech@lists.sourceforge.net> Subject: Re: [Dspace-tech] spider ip recognition Anthony I wrote a small ruby script to put solr queries together when I was poking around my stats see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb an example parameter file is https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml run it asruby solr/solr_query.rb of cause you ned to adjust the parameters in the mL file you can query like this http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false exclude records that are marked as bots do type:2 - aka items do id:218 - aka item with id 218 return one item facet on ip addresses crank up the number of rows to get more matching docs Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 7, 2015, at 3:26 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: Anyway, we want to determine whether these stats are bona fide or whether there's something wrong with the spider detection. From the documentation it seems we have to query Solr directly to do this. Not being an expert in Solr, I'm hoping someone on this list could provide the query that retrieves *all the stats for a given item* (i.e. what's listed under "Common stored fields for all usage events" in the documentation). -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
Re: [Dspace-tech] spider ip recognition
After a bit of investigation, it turns out that a significant portion of our items stats come from spiders. Any thoughts on the best way to go about removing them from Solr retroactively? There's nothing that I can see in the code that will do this by domain or agent, only IP. We're not excited at the prospect of pulling out the IPs of all the spiders in order run "stats-util -i" effectively. Cheers, Anthony From: Monika C. Mevenkamp [mailto:moni...@princeton.edu] Sent: Friday, May 08, 2015 9:59 AM To: Anthony Petryk Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] spider ip recognition Anthony I wrote a small ruby script to put solr queries together when I was poking around my stats see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb an example parameter file is https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml run it asruby solr/solr_query.rb of cause you ned to adjust the parameters in the mL file you can query like this http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false exclude records that are marked as bots do type:2 - aka items do id:218 - aka item with id 218 return one item facet on ip addresses crank up the number of rows to get more matching docs Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 7, 2015, at 3:26 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: Anyway, we want to determine whether these stats are bona fide or whether there's something wrong with the spider detection. From the documentation it seems we have to query Solr directly to do this. Not being an expert in Solr, I'm hoping someone on this list could provide the query that retrieves *all the stats for a given item* (i.e. what's listed under "Common stored fields for all usage events" in the documentation). -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
Re: [Dspace-tech] spider ip recognition
Thanks very much Monika! I'll try it out. Cheers, Anthony From: Monika C. Mevenkamp [mailto:moni...@princeton.edu] Sent: Friday, May 08, 2015 9:59 AM To: Anthony Petryk Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] spider ip recognition Anthony I wrote a small ruby script to put solr queries together when I was poking around my stats see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb an example parameter file is https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml run it asruby solr/solr_query.rb of cause you ned to adjust the parameters in the mL file you can query like this http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false exclude records that are marked as bots do type:2 - aka items do id:218 - aka item with id 218 return one item facet on ip addresses crank up the number of rows to get more matching docs Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 7, 2015, at 3:26 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: Anyway, we want to determine whether these stats are bona fide or whether there's something wrong with the spider detection. From the documentation it seems we have to query Solr directly to do this. Not being an expert in Solr, I'm hoping someone on this list could provide the query that retrieves *all the stats for a given item* (i.e. what's listed under "Common stored fields for all usage events" in the documentation). -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
Re: [Dspace-tech] spider ip recognition
Anthony I wrote a small ruby script to put solr queries together when I was poking around my stats see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb an example parameter file is https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml run it asruby solr/solr_query.rb of cause you ned to adjust the parameters in the mL file you can query like this http://localhost:YOUR-PORT/solr/statistics/select?wt=json&indent=true&rows=1&facet=true&facet.field=ip&facet.mincount=1&q=type:2+id:218+isBot:false exclude records that are marked as bots do type:2 - aka items do id:218 - aka item with id 218 return one item facet on ip addresses crank up the number of rows to get more matching docs Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 On May 7, 2015, at 3:26 PM, Anthony Petryk mailto:anthony.pet...@uottawa.ca>> wrote: Anyway, we want to determine whether these stats are bona fide or whether there's something wrong with the spider detection. From the documentation it seems we have to query Solr directly to do this. Not being an expert in Solr, I'm hoping someone on this list could provide the query that retrieves *all the stats for a given item* (i.e. what's listed under "Common stored fields for all usage events" in the documentation). -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
Re: [Dspace-tech] spider ip recognition
We're having a similar problem (DSpace 4.3). Here's an example: http://www.ruor.uottawa.ca/handle/10393/23938/statistics. These stats seem high for this item (although we could be wrong about that), but what's more peculiar is all the traffic from China. We've run through all the options for the "stats-util" command but the numbers remain the same. Note that when we run "stats-util - u" (update the IP lists), we always get the message "Not modified - so not downloaded". I'm guessing that the spider lists by domain name or user-agent are more current? Anyway, we want to determine whether these stats are bona fide or whether there's something wrong with the spider detection. From the documentation it seems we have to query Solr directly to do this. Not being an expert in Solr, I'm hoping someone on this list could provide the query that retrieves *all the stats for a given item* (i.e. what's listed under "Common stored fields for all usage events" in the documentation). Thanks for your time, Anthony Anthony Petryk Emerging Technologies Librarian | Bibliothécaire des technologies émergentes uOttawa Library | Bibliothèque uOttawa 613-562-5800 x4650 apet...@uottawa.ca -Original Message- From: Mark H. Wood [mailto:mw...@iupui.edu] Sent: Friday, April 24, 2015 9:42 AM To: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] spider ip recognition On Thu, Apr 23, 2015 at 05:39:01PM +, Monika C. Mevenkamp wrote: > I found a couple of really suspicious numbers in my solr stats, aka lots of > entries were marked as isBot=false although the probably should has been > isBot=true. > > In the config file I use > > spiderips.urls = http://iplists.com/google.txt, \ > http://iplists.com/inktomi.txt, \ > http://iplists.com/lycos.txt, \ > http://iplists.com/infoseek.txt, \ > http://iplists.com/altavista.txt, \ > http://iplists.com/excite.txt, \ > http://iplists.com/northernlight.txt, \ > http://iplists.com/misc.txt, \ > http://iplists.com/non_engines.txt > > > I could not find downloadable lists for Bing, Baidu, Yahoo. > The best I saw was: > http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html > Is that reliable ? > > Does anybody out there have lists / sources that they can share ? What version of DSpace are you running? Recent versions can also recognize spiders by regular expression matching of the domain name or UserAgent: string. (However, that only works for new entries. I've recently found that some of the tools for loading and grooming the stat.s core don't use SpiderDetector and are oblivious of these newer patterns.) > Also: does the dspace code gracefully deal with IP address patterns ? That depends on what is considered graceful. The code (in org.dspace.statistics.util.IPTable) accepts patterns in three forms: 11.22.33.44-11.22.33.55 11.22.33.44 11.22.33 Addresses in the first form may be suffixed with a CIDR mask-length, but it is currently ignored. If I've understood the code, a range (the first form) is assumed to differ only in the fourth octet. It will match all addresses between "44" and "55" within the /24 containing the start of the range. The second form is an exact match of a single address. The third form is a match of the first 24 bits -- an entire Class C subnet. There is no provision for IPv6. -- Mark H. Wood Lead Technology Analyst University Library Indiana University - Purdue University Indianapolis 755 W. Michigan Street Indianapolis, IN 46202 317-274-0749 www.ulib.iupui.edu -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
Re: [Dspace-tech] spider ip recognition
On Thu, Apr 23, 2015 at 05:39:01PM +, Monika C. Mevenkamp wrote: > I found a couple of really suspicious numbers in my solr stats, aka lots of > entries were marked as isBot=false although the probably should has been > isBot=true. > > In the config file I use > > spiderips.urls = http://iplists.com/google.txt, \ > http://iplists.com/inktomi.txt, \ > http://iplists.com/lycos.txt, \ > http://iplists.com/infoseek.txt, \ > http://iplists.com/altavista.txt, \ > http://iplists.com/excite.txt, \ > http://iplists.com/northernlight.txt, \ > http://iplists.com/misc.txt, \ > http://iplists.com/non_engines.txt > > > I could not find downloadable lists for Bing, Baidu, Yahoo. > The best I saw was: > http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html > Is that reliable ? > > Does anybody out there have lists / sources that they can share ? What version of DSpace are you running? Recent versions can also recognize spiders by regular expression matching of the domain name or UserAgent: string. (However, that only works for new entries. I've recently found that some of the tools for loading and grooming the stat.s core don't use SpiderDetector and are oblivious of these newer patterns.) > Also: does the dspace code gracefully deal with IP address patterns ? That depends on what is considered graceful. The code (in org.dspace.statistics.util.IPTable) accepts patterns in three forms: 11.22.33.44-11.22.33.55 11.22.33.44 11.22.33 Addresses in the first form may be suffixed with a CIDR mask-length, but it is currently ignored. If I've understood the code, a range (the first form) is assumed to differ only in the fourth octet. It will match all addresses between "44" and "55" within the /24 containing the start of the range. The second form is an exact match of a single address. The third form is a match of the first 24 bits -- an entire Class C subnet. There is no provision for IPv6. -- Mark H. Wood Lead Technology Analyst University Library Indiana University - Purdue University Indianapolis 755 W. Michigan Street Indianapolis, IN 46202 317-274-0749 www.ulib.iupui.edu signature.asc Description: Digital signature -- One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
[Dspace-tech] spider ip recognition
I found a couple of really suspicious numbers in my solr stats, aka lots of entries were marked as isBot=false although the probably should has been isBot=true. In the config file I use spiderips.urls = http://iplists.com/google.txt, \ http://iplists.com/inktomi.txt, \ http://iplists.com/lycos.txt, \ http://iplists.com/infoseek.txt, \ http://iplists.com/altavista.txt, \ http://iplists.com/excite.txt, \ http://iplists.com/northernlight.txt, \ http://iplists.com/misc.txt, \ http://iplists.com/non_engines.txt I could not find downloadable lists for Bing, Baidu, Yahoo. The best I saw was: http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html Is that reliable ? Does anybody out there have lists / sources that they can share ? Also: does the dspace code gracefully deal with IP address patterns ? Monika Monika Mevenkamp phone: 609-258-4161 Princeton University, Princeton, NJ 08544 -- BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette