Thanks, Alan!

Total number of bot hits purged: 575004

One thing I found curious is that I first ran it with -pno -d, then -pyes
and got a different result each time:

dspace@ir:/home/dspace$ scripts/check-spider-hits.sh -u
http://localhost:8080/solr -f /dspacecris-dut/config/spiders/agents/example
-pno -d
(DEBUG) Using spiders pattern file:
/dspacecris-dut/config/spiders/agents/example
(DEBUG) Checking for hits from spider: AllenTrack
(DEBUG) Checking for hits from spider: Arachmo
(DEBUG) Checking for hits from spider: ContentSmartz
(DEBUG) Checking for hits from spider: DSurf
(DEBUG) Checking for hits from spider: EmailSiphon
(DEBUG) Checking for hits from spider: EmailWolf
(DEBUG) Checking for hits from spider: GetRight
(DEBUG) Checking for hits from spider: Googlebot
Found 325498 hits from Googlebot in statistics
(DEBUG) Checking for hits from spider: HTTrack
Found 1366 hits from HTTrack in statistics
(DEBUG) Checking for hits from spider: LOCKSS
(DEBUG) Checking for hits from spider: MSNBot
(DEBUG) Checking for hits from spider: Milbot
(DEBUG) Checking for hits from spider: MuscatFerre
(DEBUG) Checking for hits from spider: NABOT
(DEBUG) Checking for hits from spider: NaverBot
(DEBUG) Checking for hits from spider: OurBrowser
(DEBUG) Checking for hits from spider: Readpaper
(DEBUG) Checking for hits from spider: Strider
Found 1 hits from Strider in statistics
(DEBUG) Checking for hits from spider: Teoma
Found 2 hits from Teoma in statistics
(DEBUG) Checking for hits from spider: Wanadoo
Found 7 hits from Wanadoo in statistics
(DEBUG) Checking for hits from spider: WebCloner
(DEBUG) Checking for hits from spider: WebCopier
(DEBUG) Checking for hits from spider: WebReaper
(DEBUG) Checking for hits from spider: WebStripper
(DEBUG) Checking for hits from spider: WebZIP
(DEBUG) Checking for hits from spider: Webinator
(DEBUG) Checking for hits from spider: Webmetrics
(DEBUG) Checking for hits from spider: Wget
Found 170 hits from Wget in statistics
(DEBUG) Checking for hits from spider: alexa
Found 238 hits from alexa in statistics
(DEBUG) Checking for hits from spider: almaden
(DEBUG) Checking for hits from spider: appie
(DEBUG) Checking for hits from spider: architext
(DEBUG) Checking for hits from spider: arks
Found 18 hits from arks in statistics
(DEBUG) Checking for hits from spider: asterias
(DEBUG) Checking for hits from spider: atomz
(DEBUG) Checking for hits from spider: autoemailspider
(DEBUG) Checking for hits from spider: awbot
(DEBUG) Checking for hits from spider: baiduspider
(DEBUG) Checking for hits from spider: bbot
(DEBUG) Checking for hits from spider: biadu
(DEBUG) Checking for hits from spider: biglotron
(DEBUG) Checking for hits from spider: bjaaland
(DEBUG) Checking for hits from spider: bloglines
(DEBUG) Checking for hits from spider: blogpulse
(DEBUG) Checking for hits from spider: bot
Found 520514 hits from bot in statistics
(DEBUG) Checking for hits from spider: bspider
Found 72 hits from bspider in statistics
(DEBUG) Checking for hits from spider: bwh3_user_agent
(DEBUG) Checking for hits from spider: celestial
(DEBUG) Checking for hits from spider: cfnetwork|checkbot
(DEBUG) Solr query returned HTTP 400, skipping cfnetwork|checkbot.
(DEBUG) Checking for hits from spider: combine
(DEBUG) Checking for hits from spider: contentmatch
(DEBUG) Checking for hits from spider: core
(DEBUG) Checking for hits from spider: crawl
Found 15205 hits from crawl in statistics
(DEBUG) Checking for hits from spider: crawler
Found 15191 hits from crawler in statistics
(DEBUG) Checking for hits from spider: cursor
(DEBUG) Checking for hits from spider: custo
Found 4 hits from custo in statistics
(DEBUG) Checking for hits from spider: daumoa
(DEBUG) Checking for hits from spider: docomo
(DEBUG) Checking for hits from spider: dtSearchSpider
(DEBUG) Checking for hits from spider: dumbot
(DEBUG) Checking for hits from spider: easydl
(DEBUG) Checking for hits from spider: exabot
Found 133 hits from exabot in statistics
(DEBUG) Checking for hits from spider: fast-webcrawler
(DEBUG) Checking for hits from spider: favorg
(DEBUG) Checking for hits from spider: feedburner
(DEBUG) Checking for hits from spider: ferret
(DEBUG) Checking for hits from spider: findlinks
Found 10626 hits from findlinks in statistics
(DEBUG) Checking for hits from spider: gaisbot
(DEBUG) Checking for hits from spider: geturl
(DEBUG) Checking for hits from spider: gigabot
(DEBUG) Checking for hits from spider: girafabot
(DEBUG) Checking for hits from spider: gnodspider
(DEBUG) Checking for hits from spider: google
Found 327642 hits from google in statistics
(DEBUG) Checking for hits from spider: grub
(DEBUG) Checking for hits from spider: gulliver
(DEBUG) Checking for hits from spider: harvest
(DEBUG) Checking for hits from spider: heritrix
Found 765 hits from heritrix in statistics
(DEBUG) Checking for hits from spider: hl_ftien_spider
(DEBUG) Checking for hits from spider: holmes
(DEBUG) Checking for hits from spider: htdig
(DEBUG) Checking for hits from spider: htmlparser
(DEBUG) Checking for hits from spider: httrack
(DEBUG) Checking for hits from spider: iSiloX
(DEBUG) Checking for hits from spider: ia_archiver
Found 243 hits from ia_archiver in statistics
(DEBUG) Checking for hits from spider: ichiro
Found 1153 hits from ichiro in statistics
(DEBUG) Checking for hits from spider: iktomi
(DEBUG) Checking for hits from spider: ilse
(DEBUG) Checking for hits from spider: internetseer
(DEBUG) Checking for hits from spider: intute
(DEBUG) Checking for hits from spider: java
Found 2 hits from java in statistics
(DEBUG) Checking for hits from spider: jeeves
(DEBUG) Checking for hits from spider: jobo
(DEBUG) Checking for hits from spider: kyluka
(DEBUG) Checking for hits from spider: larbin
(DEBUG) Checking for hits from spider: libwww
Found 113 hits from libwww in statistics
(DEBUG) Checking for hits from spider: lilina
(DEBUG) Checking for hits from spider: linkbot
(DEBUG) Checking for hits from spider: linkcheck
(DEBUG) Checking for hits from spider: linkchecker
(DEBUG) Checking for hits from spider: linkscan
(DEBUG) Checking for hits from spider: linkwalker
(DEBUG) Checking for hits from spider: lmspider
(DEBUG) Checking for hits from spider: lwp
(DEBUG) Checking for hits from spider: megite
(DEBUG) Checking for hits from spider: milbot
(DEBUG) Checking for hits from spider: mimas
(DEBUG) Checking for hits from spider: mj12bot
(DEBUG) Checking for hits from spider: mnogosearch
(DEBUG) Checking for hits from spider: moget
(DEBUG) Checking for hits from spider: mojeekbot
(DEBUG) Checking for hits from spider: momspider
(DEBUG) Checking for hits from spider: motor
Found 8 hits from motor in statistics
(DEBUG) Checking for hits from spider: msiecrawler
(DEBUG) Checking for hits from spider: msnbot
Found 8993 hits from msnbot in statistics
(DEBUG) Checking for hits from spider: myweb
(DEBUG) Checking for hits from spider: nagios
(DEBUG) Checking for hits from spider: netcraft
(DEBUG) Checking for hits from spider: netluchs
(DEBUG) Checking for hits from spider: no_user_agent
(DEBUG) Checking for hits from spider: nomad
(DEBUG) Checking for hits from spider: nutch
Found 68 hits from nutch in statistics
(DEBUG) Checking for hits from spider: ocelli
(DEBUG) Checking for hits from spider: onetszukaj
(DEBUG) Checking for hits from spider: perman
(DEBUG) Checking for hits from spider: pioneer
(DEBUG) Checking for hits from spider: powermarks
(DEBUG) Checking for hits from spider: psbot
Found 3 hits from psbot in statistics
(DEBUG) Checking for hits from spider: python
Found 1 hits from python in statistics
(DEBUG) Checking for hits from spider: qihoobot
(DEBUG) Checking for hits from spider: rambler
(DEBUG) Checking for hits from spider: redalert|robozilla
(DEBUG) Solr query returned HTTP 400, skipping redalert|robozilla.
(DEBUG) Checking for hits from spider: robot
Found 56183 hits from robot in statistics
(DEBUG) Checking for hits from spider: robots
Found 43145 hits from robots in statistics
(DEBUG) Checking for hits from spider: rss
(DEBUG) Checking for hits from spider: scan4mail
(DEBUG) Checking for hits from spider: scientificcommons
(DEBUG) Checking for hits from spider: scirus
(DEBUG) Checking for hits from spider: scooter
(DEBUG) Checking for hits from spider: seekbot
(DEBUG) Checking for hits from spider: seznambot
(DEBUG) Checking for hits from spider: shoutcast
(DEBUG) Checking for hits from spider: slurp
Found 104 hits from slurp in statistics
(DEBUG) Checking for hits from spider: sogou
Found 2178 hits from sogou in statistics
(DEBUG) Checking for hits from spider: speedy
Found 139 hits from speedy in statistics
(DEBUG) Checking for hits from spider: spider
Found 23341 hits from spider in statistics
(DEBUG) Checking for hits from spider: spiderman
(DEBUG) Checking for hits from spider: spiderview
(DEBUG) Checking for hits from spider: sunrise
(DEBUG) Checking for hits from spider: superbot
(DEBUG) Checking for hits from spider: surveybot
(DEBUG) Checking for hits from spider: tailrank
(DEBUG) Checking for hits from spider: technoratibot
(DEBUG) Checking for hits from spider: titan
(DEBUG) Checking for hits from spider: turnitinbot
(DEBUG) Checking for hits from spider: twiceler
(DEBUG) Checking for hits from spider: ucsd
(DEBUG) Checking for hits from spider: ultraseek
(DEBUG) Checking for hits from spider: urlaliasbuilder
(DEBUG) Checking for hits from spider: urllib
Found 66 hits from urllib in statistics
(DEBUG) Checking for hits from spider: voila
(DEBUG) Checking for hits from spider: webcollage
(DEBUG) Checking for hits from spider: weblayers
(DEBUG) Checking for hits from spider: webmirror
(DEBUG) Checking for hits from spider: webreaper
(DEBUG) Checking for hits from spider: wordpress
(DEBUG) Checking for hits from spider: worm
(DEBUG) Checking for hits from spider: xenu
(DEBUG) Checking for hits from spider: yacy
Found 2 hits from yacy in statistics
(DEBUG) Checking for hits from spider: yahoo
Found 153 hits from yahoo in statistics
(DEBUG) Checking for hits from spider: yahoofeedseeker
(DEBUG) Checking for hits from spider: yahooseeker
(DEBUG) Checking for hits from spider: yandex
Found 8591 hits from yandex in statistics
(DEBUG) Checking for hits from spider: yodaobot
(DEBUG) Checking for hits from spider: zealbot
(DEBUG) Checking for hits from spider: zeus
(DEBUG) Checking for hits from spider: zyborg
(DEBUG) Checking for hits from spider: parsijoo
Found 38 hits from parsijoo in statistics
(DEBUG) Checking for hits from spider: validator

Total number of hits from bots: 1361976
dspace@ir:/home/dspace$ scripts/check-spider-hits.sh -u
http://localhost:8080/solr -f /dspacecris-dut/config/spiders/agents/example
-pyes
Purging 325498 hits from Googlebot in statistics
Purging 1366 hits from HTTrack in statistics
Purging 1 hits from Strider in statistics
Purging 2 hits from Teoma in statistics
Purging 7 hits from Wanadoo in statistics
Purging 170 hits from Wget in statistics
Purging 238 hits from alexa in statistics
Purging 18 hits from arks in statistics
Purging 195014 hits from bot in statistics
Purging 72 hits from bspider in statistics
Purging 14714 hits from crawl in statistics
Purging 4 hits from custo in statistics
Purging 10626 hits from findlinks in statistics
Purging 2271 hits from google in statistics
Purging 765 hits from heritrix in statistics
Purging 5 hits from ia_archiver in statistics
Purging 598 hits from ichiro in statistics
Purging 2 hits from java in statistics
Purging 113 hits from libwww in statistics
Purging 8 hits from motor in statistics
Purging 1 hits from python in statistics
Purging 103 hits from slurp in statistics
Purging 2178 hits from sogou in statistics
Purging 139 hits from speedy in statistics
Purging 20938 hits from spider in statistics
Purging 66 hits from urllib in statistics
Purging 49 hits from yahoo in statistics
Purging 38 hits from parsijoo in statistics

Total number of bot hits purged: 575004


On Sun, 10 Nov 2019 at 18:12, Alan Orth <alan.o...@gmail.com> wrote:

> Dear list,
>
> I ended up writing a little bash script¹ to read known spider user agents
> from a file such as DSpace's `example` pattern file and check for matching
> documents in the Solr statistics core (or yearly statistics shards). It can
> optionally purge the matched records, but this is disabled by default. In
> our case, I purged 2 MILLION hits from our statistics core, which has data
> going back nine years. It feels nice to know that our usage statistics are
> more accurate now, though the repository managers will be depressed because
> their content wasn't as popular as they thought. :)
>
> To use the script you need to be able to access your DSpace's Solr
> instance directly, either by running the script on the same machine or by
> making the port available via an SSH tunnel:
>
> $ ssh -L 8080:localhost:8080 dspace.example.edu
>
> Then you can run the script, specifying the location of the Solr instance
> and the location of the patterns file:
>
> $ ./check-spider-hits.sh -u http://localhost:8080/solr -f
> ~/dspace/config/spiders/agents/example
>
> Read the script source or check its help text with `-h` to see more
> options. There is one implementation detail that is interesting: DSpace
> uses the spider agents file from the COUNTER-Robots project², which
> contains some plaintext names as well as regular expressions. Unfortunately
> Solr 4.x as used in current DSpace 5 and 6 only has basic support for
> regular expressions. For example, all patterns are anchored with ^ and $ by
> default, you need to use [0-9] instead of \d, etc. As such, my script does
> some basic filtering of the input pattern file to remove user agents that
> are using regular expression characters. I imagine this is part of the
> reason why DSpace's mark spider feature was never completed for user
> agents, because the example agents file used by SpiderDetector.java cannot
> be used when searching Solr later for marking spiders.
>
> I hope this is helpful for someone. Thanks to the contributors of the
> COUNTER-Robots project for curating this list.
>
> Regards,
>
> ¹ https://github.com/ilri/DSpace/blob/5_x-prod/check-spider-hits.sh
> ² https://github.com/atmire/COUNTER-Robots
>
> On Thu, Nov 7, 2019 at 3:55 PM Alan Orth <alan.o...@gmail.com> wrote:
>
>> Thank you, Mark. For now I'll just settle for an updated list of spider
>> agents from COUNTER-Robots¹ (dropping the text file into
>> dspace/config/spiders/agents seems to work).
>>
>> Regards,
>>
>> ¹ https://github.com/atmire/COUNTER-Robots
>>
>> On Tue, Nov 5, 2019 at 4:02 PM Mark H. Wood <mwoodiu...@gmail.com> wrote:
>>
>>> On Mon, Nov 04, 2019 at 11:10:25PM +0200, Alan Orth wrote:
>>> > The DSpace 5.x (and presumably 6.x) documentation[0] suggests that it
>>> is
>>> > possible to mark existing Solr statistics records as being bots or
>>> spiders
>>> > using the following command:
>>> >
>>> > $ dspace stats-util -m
>>> >
>>> > After trying to test this with an updated list of user agents[1] for a
>>> > while I realized that the feature is only implemented for IPs. As it
>>> stands
>>> > right now the code in StatisticsClient.java only marks robots based on
>>> > their IPs, but not on their user agents or domains:
>>> >
>>> > else if (line.hasOption('m'))
>>> > {
>>> >     SolrLogger.markRobotsByIP();
>>> > }
>>> >
>>> > Strangely enough, SolrLogger has a markRobotByUserAgent() function
>>> that is
>>> > never called anywhere in the Java code base (also it seems to only be
>>> > partially implemented, as it does not iterate over agents).
>>> >
>>> > Should I file a bug? This issue affects DSpace 5.x and 6.x for sure.
>>>
>>> https://jira.duraspace.org/browse/DS-2431
>>>
>>> There are several Issues related to completing the work on extended
>>> spider marking and filtering.
>>>
>>> --
>>> Mark H. Wood
>>> Lead Technology Analyst
>>>
>>> University Library
>>> Indiana University - Purdue University Indianapolis
>>> 755 W. Michigan Street
>>> Indianapolis, IN 46202
>>> 317-274-0749
>>> www.ulib.iupui.edu
>>>
>>> --
>>> All messages to this mailing list should adhere to the DuraSpace Code of
>>> Conduct: https://duraspace.org/about/policies/code-of-conduct/
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "DSpace Technical Support" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to dspace-tech+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/dspace-tech/20191105140039.GA30402%40IUPUI.Edu
>>> .
>>>
>>
>>
>> --
>> Alan Orth
>> alan.o...@gmail.com
>> https://picturingjordan.com
>> https://englishbulgaria.net
>> https://mjanja.ch
>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>
>
>
> --
> Alan Orth
> alan.o...@gmail.com
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>
> --
> All messages to this mailing list should adhere to the DuraSpace Code of
> Conduct: https://duraspace.org/about/policies/code-of-conduct/
> ---
> You received this message because you are subscribed to the Google Groups
> "DSpace Technical Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dspace-tech+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dspace-tech/CAKKdN4Xs1_AOP9UWaaScEFb26a_q36A7jnVsZ_dYGcrAuF_8tQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/dspace-tech/CAKKdN4Xs1_AOP9UWaaScEFb26a_q36A7jnVsZ_dYGcrAuF_8tQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>


--

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-tech/CA%2BxAuhOzOOmmyUFfx0pqUUdZGAwdz%3D83qQBRCZqGjms8ciD%2BWg%40mail.gmail.com.

Reply via email to