Re: [Dspace-tech] spider ip recognition

2015-05-15 Thread Anthony Petryk
Thanks for the info Hardy.  I just discovered that Jira issue yesterday.  I'll 
probably use your approach for our own stats, but I'm sure other sites would 
benefit from domain/agent handling when running stats-util -m or stats-util 
-i (as described in the issue).

Best,

Anthony

From: Pottinger, Hardy J. [mailto:pottinge...@missouri.edu]
Sent: Friday, May 15, 2015 10:19 AM
To: Anthony Petryk; Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net
Subject: RE: [Dspace-tech] spider ip recognition

Hi, you've run into a known issue, and one I very recently wrestled with myself:

https://jira.duraspace.org/browse/DS-2431

See my last comment on that ticket, I found a way around the issue, by simply 
deleting the spider docs from the stats index via a query in the Solr admin 
interface.

--Hardy

From: Anthony Petryk [anthony.pet...@uottawa.ca]
Sent: Thursday, May 14, 2015 12:06 PM
To: Monika C. Mevenkamp; 
dspace-tech@lists.sourceforge.netmailto:dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition
Hi again,

Unfortunately, the documentation for the stats-util command is incorrect.  
Specifically this line:

-i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS name, 
or Agent name. Will prune out all records that match spider identification 
patterns.

Running stats-util -i does not actually remove spiders by DNS name or Agent 
name.  Here's are the relevant sections of the code, from StatisticsClient.java 
and SolrLogger.java:

(...)
else if(line.hasOption('i'))
{
SolrLogger.deleteRobotsByIP();
}

public static void deleteRobotsByIP()
{
for(String ip : SpiderDetector.getSpiderIpAddresses()){
deleteIP(ip);
}
}

What this means is that, if a spider is in your Solr stats, there's no way to 
remove it other than manually adding its IP to [dpsace]/config/spiders; adding 
its DNS name or Agent name to the configs will not expunge it.  Updating the 
spider files with stats-util -u does little to help because the IP lists it 
pulls from are out of date.

An example is the spider from the Bing search engine: bingbot.  As of DSpace 
4.3, it's not in the list of spiders by DNS name or Agent name, nor is it in 
the list of spider IP addresses.  So anyone running DSpace 4.3 likely has usage 
stats inflated by visits from this spider.  The only way to remove it is to 
specify all the IPs for bingbot.  Multiply that by all the other new spiders 
and we're talking about a lot of work.

I tried briefly to modify the code to take domains/agents into account when 
marking or deleting spiders, but I wasn't able to figure out how to query Solr 
with regex patterns.  It's easier to do with IPs because each IP or IP range is 
transformed into a String and used as a standard query parameter.

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Thursday, May 14, 2015 11:17 AM
To: Anthony Petryk
Cc: Monika C. Mevenkamp; 
dspace-tech@lists.sourceforge.netmailto:dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

Since dspace 4 you can filter by userAgent
see 
https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders
I have not used this myself and am not sure whether these filters are applied 
as crawlers access content - or whether you need to run the
[dspace]/bin/dspace stats-util command on a regular basis. You definitely need 
to run it to prune  mark usage events after you configure
a list of userAgents you want to filter against.

Monika


Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544

On May 12, 2015, at 2:13 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

After a bit of investigation, it turns out that a significant portion of our 
items stats come from spiders.  Any thoughts on the best way to go about 
removing them from Solr retroactively?  There's nothing that I can see in the 
code that will do this by domain or agent, only IP.  We're not excited at the 
prospect of pulling out the IPs of all the spiders in order run stats-util -i 
effectively.

Cheers,

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspace-tech@lists.sourceforge.netmailto:dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

I wrote a small ruby script to put solr queries together when I was poking 
around my stats

see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb
an example parameter file is 
https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml

run it asruby solr/solr_query.rb

of cause you ned to adjust the parameters in the mL file

you can query like this

http://localhost:YOUR-PORT/solr/statistics/select?wt=jsonindent=truerows=1facet

Re: [Dspace-tech] spider ip recognition

2015-05-15 Thread Pottinger, Hardy J.
Hi, you've run into a known issue, and one I very recently wrestled with myself:

https://jira.duraspace.org/browse/DS-2431

See my last comment on that ticket, I found a way around the issue, by simply 
deleting the spider docs from the stats index via a query in the Solr admin 
interface.

--Hardy


From: Anthony Petryk [anthony.pet...@uottawa.ca]
Sent: Thursday, May 14, 2015 12:06 PM
To: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Hi again,

Unfortunately, the documentation for the stats-util command is incorrect.  
Specifically this line:

-i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS name, 
or Agent name. Will prune out all records that match spider identification 
patterns.

Running “stats-util –i” does not actually remove spiders by DNS name or Agent 
name.  Here’s are the relevant sections of the code, from StatisticsClient.java 
and SolrLogger.java:

(…)
else if(line.hasOption('i'))
{
SolrLogger.deleteRobotsByIP();
}

public static void deleteRobotsByIP()
{
for(String ip : SpiderDetector.getSpiderIpAddresses()){
deleteIP(ip);
}
}

What this means is that, if a spider is in your Solr stats, there’s no way to 
remove it other than manually adding its IP to [dpsace]/config/spiders; adding 
its DNS name or Agent name to the configs will not expunge it.  Updating the 
spider files with “stats-util –u” does little to help because the IP lists it 
pulls from are out of date.

An example is the spider from the Bing search engine: bingbot.  As of DSpace 
4.3, it’s not in the list of spiders by DNS name or Agent name, nor is it in 
the list of spider IP addresses.  So anyone running DSpace 4.3 likely has usage 
stats inflated by visits from this spider.  The only way to remove it is to 
specify all the IPs for bingbot.  Multiply that by all the other “new” spiders 
and we’re talking about a lot of work.

I tried briefly to modify the code to take domains/agents into account when 
marking or deleting spiders, but I wasn’t able to figure out how to query Solr 
with regex patterns.  It’s easier to do with IPs because each IP or IP range is 
transformed into a String and used as a standard query parameter.

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Thursday, May 14, 2015 11:17 AM
To: Anthony Petryk
Cc: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

Since dspace 4 you can filter by userAgent
see 
https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders
I have not used this myself and am not sure whether these filters are applied 
as crawlers access content - or whether you need to run the
[dspace]/bin/dspace stats-util command on a regular basis. You definitely need 
to run it to prune  mark usage events after you configure
a list of userAgents you want to filter against.

Monika


Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544

On May 12, 2015, at 2:13 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

After a bit of investigation, it turns out that a significant portion of our 
items stats come from spiders.  Any thoughts on the best way to go about 
removing them from Solr retroactively?  There’s nothing that I can see in the 
code that will do this by domain or agent, only IP.  We’re not excited at the 
prospect of pulling out the IPs of all the spiders in order run “stats-util –i” 
effectively.

Cheers,

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspace-tech@lists.sourceforge.netmailto:dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

I wrote a small ruby script to put solr queries together when I was poking 
around my stats

see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb
an example parameter file is 
https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml

run it asruby solr/solr_query.rb

of cause you ned to adjust the parameters in the mL file

you can query like this

http://localhost:YOUR-PORT/solr/statistics/select?wt=jsonindent=truerows=1facet=truefacet.field=ipfacet.mincount=1q=type:2+id:218+isBot:falseUrlBlockedError.aspx

exclude records that are marked as bots
do type:2 - aka items
do id:218 - aka item with id 218
return one item
facet on ip addresses

crank up the number of rows to get more matching docs

Monika




Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544



On May 7, 2015, at 3:26 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

Anyway, we want to determine whether these stats are bona fide or whether 
there's something

Re: [Dspace-tech] spider ip recognition

2015-05-14 Thread Anthony Petryk
Hi again,

Unfortunately, the documentation for the stats-util command is incorrect.  
Specifically this line:

-i or --delete-spiders-by-ip: Delete Spiders in Solr By IP Address, DNS name, 
or Agent name. Will prune out all records that match spider identification 
patterns.

Running “stats-util –i” does not actually remove spiders by DNS name or Agent 
name.  Here’s are the relevant sections of the code, from StatisticsClient.java 
and SolrLogger.java:

(…)
else if(line.hasOption('i'))
{
SolrLogger.deleteRobotsByIP();
}

public static void deleteRobotsByIP()
{
for(String ip : SpiderDetector.getSpiderIpAddresses()){
deleteIP(ip);
}
}

What this means is that, if a spider is in your Solr stats, there’s no way to 
remove it other than manually adding its IP to [dpsace]/config/spiders; adding 
its DNS name or Agent name to the configs will not expunge it.  Updating the 
spider files with “stats-util –u” does little to help because the IP lists it 
pulls from are out of date.

An example is the spider from the Bing search engine: bingbot.  As of DSpace 
4.3, it’s not in the list of spiders by DNS name or Agent name, nor is it in 
the list of spider IP addresses.  So anyone running DSpace 4.3 likely has usage 
stats inflated by visits from this spider.  The only way to remove it is to 
specify all the IPs for bingbot.  Multiply that by all the other “new” spiders 
and we’re talking about a lot of work.

I tried briefly to modify the code to take domains/agents into account when 
marking or deleting spiders, but I wasn’t able to figure out how to query Solr 
with regex patterns.  It’s easier to do with IPs because each IP or IP range is 
transformed into a String and used as a standard query parameter.

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Thursday, May 14, 2015 11:17 AM
To: Anthony Petryk
Cc: Monika C. Mevenkamp; dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

Since dspace 4 you can filter by userAgent
see 
https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders
I have not used this myself and am not sure whether these filters are applied 
as crawlers access content - or whether you need to run the
[dspace]/bin/dspace stats-util command on a regular basis. You definitely need 
to run it to prune  mark usage events after you configure
a list of userAgents you want to filter against.

Monika


Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544

On May 12, 2015, at 2:13 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

After a bit of investigation, it turns out that a significant portion of our 
items stats come from spiders.  Any thoughts on the best way to go about 
removing them from Solr retroactively?  There’s nothing that I can see in the 
code that will do this by domain or agent, only IP.  We’re not excited at the 
prospect of pulling out the IPs of all the spiders in order run “stats-util –i” 
effectively.

Cheers,

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspace-tech@lists.sourceforge.netmailto:dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

I wrote a small ruby script to put solr queries together when I was poking 
around my stats

see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb
an example parameter file is 
https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml

run it asruby solr/solr_query.rb

of cause you ned to adjust the parameters in the mL file

you can query like this

http://localhost:YOUR-PORT/solr/statistics/select?wt=jsonindent=truerows=1facet=truefacet.field=ipfacet.mincount=1q=type:2+id:218+isBot:false

exclude records that are marked as bots
do type:2 - aka items
do id:218 - aka item with id 218
return one item
facet on ip addresses

crank up the number of rows to get more matching docs

Monika




Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544



On May 7, 2015, at 3:26 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

Anyway, we want to determine whether these stats are bona fide or whether 
there's something wrong with the spider detection.  From the documentation it 
seems we have to query Solr directly to do this.  Not being an expert in Solr, 
I'm hoping someone on this list could provide the query that retrieves *all the 
stats for a given item* (i.e. what's listed under Common stored fields for all 
usage events in the documentation).

--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications

Re: [Dspace-tech] spider ip recognition

2015-05-14 Thread Monika C. Mevenkamp
Anthony

Since dspace 4 you can filter by userAgent
see 
https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-FilteringandPruningSpiders
I have not used this myself and am not sure whether these filters are applied 
as crawlers access content - or whether you need to run the
[dspace]/bin/dspace stats-util command on a regular basis. You definitely need 
to run it to prune  mark usage events after you configure
a list of userAgents you want to filter against.

Monika


Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544


On May 12, 2015, at 2:13 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

After a bit of investigation, it turns out that a significant portion of our 
items stats come from spiders.  Any thoughts on the best way to go about 
removing them from Solr retroactively?  There’s nothing that I can see in the 
code that will do this by domain or agent, only IP.  We’re not excited at the 
prospect of pulling out the IPs of all the spiders in order run “stats-util –i” 
effectively.

Cheers,

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspace-tech@lists.sourceforge.netmailto:dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

I wrote a small ruby script to put solr queries together when I was poking 
around my stats

see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb
an example parameter file is 
https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml

run it asruby solr/solr_query.rb

of cause you ned to adjust the parameters in the mL file

you can query like this

http://localhost:YOUR-PORT/solr/statistics/select?wt=jsonindent=truerows=1facet=truefacet.field=ipfacet.mincount=1q=type:2+id:218+isBot:false

exclude records that are marked as bots
do type:2 - aka items
do id:218 - aka item with id 218
return one item
facet on ip addresses

crank up the number of rows to get more matching docs

Monika




Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544


On May 7, 2015, at 3:26 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

Anyway, we want to determine whether these stats are bona fide or whether 
there's something wrong with the spider detection.  From the documentation it 
seems we have to query Solr directly to do this.  Not being an expert in Solr, 
I'm hoping someone on this list could provide the query that retrieves *all the 
stats for a given item* (i.e. what's listed under Common stored fields for all 
usage events in the documentation).

--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Re: [Dspace-tech] spider ip recognition

2015-05-12 Thread Anthony Petryk
After a bit of investigation, it turns out that a significant portion of our 
items stats come from spiders.  Any thoughts on the best way to go about 
removing them from Solr retroactively?  There's nothing that I can see in the 
code that will do this by domain or agent, only IP.  We're not excited at the 
prospect of pulling out the IPs of all the spiders in order run stats-util -i 
effectively.

Cheers,

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

I wrote a small ruby script to put solr queries together when I was poking 
around my stats

see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb
an example parameter file is 
https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml

run it asruby solr/solr_query.rb

of cause you ned to adjust the parameters in the mL file

you can query like this

http://localhost:YOUR-PORT/solr/statistics/select?wt=jsonindent=truerows=1facet=truefacet.field=ipfacet.mincount=1q=type:2+id:218+isBot:false

exclude records that are marked as bots
do type:2 - aka items
do id:218 - aka item with id 218
return one item
facet on ip addresses

crank up the number of rows to get more matching docs

Monika




Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544


On May 7, 2015, at 3:26 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

Anyway, we want to determine whether these stats are bona fide or whether 
there's something wrong with the spider detection.  From the documentation it 
seems we have to query Solr directly to do this.  Not being an expert in Solr, 
I'm hoping someone on this list could provide the query that retrieves *all the 
stats for a given item* (i.e. what's listed under Common stored fields for all 
usage events in the documentation).

--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Re: [Dspace-tech] spider ip recognition

2015-05-08 Thread Monika C. Mevenkamp
Anthony

I wrote a small ruby script to put solr queries together when I was poking 
around my stats

see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb
an example parameter file is 
https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml

run it asruby solr/solr_query.rb

of cause you ned to adjust the parameters in the mL file

you can query like this

http://localhost:YOUR-PORT/solr/statistics/select?wt=jsonindent=truerows=1facet=truefacet.field=ipfacet.mincount=1q=type:2+id:218+isBot:false

exclude records that are marked as bots
do type:2 - aka items
do id:218 - aka item with id 218
return one item
facet on ip addresses

crank up the number of rows to get more matching docs

Monika




Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544


On May 7, 2015, at 3:26 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

Anyway, we want to determine whether these stats are bona fide or whether 
there's something wrong with the spider detection.  From the documentation it 
seems we have to query Solr directly to do this.  Not being an expert in Solr, 
I'm hoping someone on this list could provide the query that retrieves *all the 
stats for a given item* (i.e. what's listed under Common stored fields for all 
usage events in the documentation).

--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Re: [Dspace-tech] spider ip recognition

2015-05-08 Thread Anthony Petryk
Thanks very much Monika!  I'll try it out.

Cheers,

Anthony

From: Monika C. Mevenkamp [mailto:moni...@princeton.edu]
Sent: Friday, May 08, 2015 9:59 AM
To: Anthony Petryk
Cc: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

Anthony

I wrote a small ruby script to put solr queries together when I was poking 
around my stats

see https://github.com/akinom/dscriptor/blob/master/solr/solr_query.rb
an example parameter file is 
https://github.com/akinom/dscriptor/blob/master/solr/solr_query.yml

run it asruby solr/solr_query.rb

of cause you ned to adjust the parameters in the mL file

you can query like this

http://localhost:YOUR-PORT/solr/statistics/select?wt=jsonindent=truerows=1facet=truefacet.field=ipfacet.mincount=1q=type:2+id:218+isBot:false

exclude records that are marked as bots
do type:2 - aka items
do id:218 - aka item with id 218
return one item
facet on ip addresses

crank up the number of rows to get more matching docs

Monika




Monika Mevenkamp
phone: 609-258-4161
Princeton University, Princeton, NJ 08544


On May 7, 2015, at 3:26 PM, Anthony Petryk 
anthony.pet...@uottawa.camailto:anthony.pet...@uottawa.ca wrote:

Anyway, we want to determine whether these stats are bona fide or whether 
there's something wrong with the spider detection.  From the documentation it 
seems we have to query Solr directly to do this.  Not being an expert in Solr, 
I'm hoping someone on this list could provide the query that retrieves *all the 
stats for a given item* (i.e. what's listed under Common stored fields for all 
usage events in the documentation).

--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Re: [Dspace-tech] spider ip recognition

2015-05-07 Thread Anthony Petryk
We're having a similar problem (DSpace 4.3).

Here's an example: http://www.ruor.uottawa.ca/handle/10393/23938/statistics.  
These stats seem high for this item (although we could be wrong about that), 
but what's more peculiar is all the traffic from China.  We've run through all 
the options for the stats-util command but the numbers remain the same.  Note 
that when we run stats-util - u (update the IP lists), we always get the 
message Not modified - so not downloaded.  I'm guessing that the spider lists 
by domain name or user-agent are more current?  

Anyway, we want to determine whether these stats are bona fide or whether 
there's something wrong with the spider detection.  From the documentation it 
seems we have to query Solr directly to do this.  Not being an expert in Solr, 
I'm hoping someone on this list could provide the query that retrieves *all the 
stats for a given item* (i.e. what's listed under Common stored fields for all 
usage events in the documentation).

Thanks for your time,

Anthony

Anthony Petryk
Emerging Technologies Librarian | Bibliothécaire des technologies émergentes
uOttawa Library | Bibliothèque uOttawa
613-562-5800 x4650
apet...@uottawa.ca


-Original Message-
From: Mark H. Wood [mailto:mw...@iupui.edu] 
Sent: Friday, April 24, 2015 9:42 AM
To: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] spider ip recognition

On Thu, Apr 23, 2015 at 05:39:01PM +, Monika C. Mevenkamp wrote:
 I found a couple of really suspicious numbers in my solr stats, aka lots of 
 entries were marked as isBot=false although the probably should has been 
 isBot=true.
 
 In the config file  I use
 
 spiderips.urls = http://iplists.com/google.txt, \
  http://iplists.com/inktomi.txt, \
  http://iplists.com/lycos.txt, \
  http://iplists.com/infoseek.txt, \
  http://iplists.com/altavista.txt, \
  http://iplists.com/excite.txt, \
  http://iplists.com/northernlight.txt, \
  http://iplists.com/misc.txt, \
  http://iplists.com/non_engines.txt
 
 
 I could not find downloadable lists for Bing, Baidu, Yahoo.
 The best I saw was:   
 http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html
 Is that reliable  ?
 
 Does anybody out there have lists / sources that they can share ?

What version of DSpace are you running?  Recent versions can also recognize 
spiders by regular expression matching of the domain name or
UserAgent: string.  (However, that only works for new entries.  I've recently 
found that some of the tools for loading and grooming the stat.s core don't use 
SpiderDetector and are oblivious of these newer
patterns.)

 Also: does the dspace code gracefully deal with IP address patterns ?

That depends on what is considered graceful.  The code (in
org.dspace.statistics.util.IPTable) accepts patterns in three forms:

  11.22.33.44-11.22.33.55
  11.22.33.44
  11.22.33

Addresses in the first form may be suffixed with a CIDR mask-length, but it is 
currently ignored.

If I've understood the code, a range (the first form) is assumed to differ only 
in the fourth octet.  It will match all addresses between 44 and 55 within 
the /24 containing the start of the range.

The second form is an exact match of a single address.

The third form is a match of the first 24 bits -- an entire Class C subnet.

There is no provision for IPv6.

--
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] spider ip recognition

2015-04-24 Thread Mark H. Wood
On Thu, Apr 23, 2015 at 05:39:01PM +, Monika C. Mevenkamp wrote:
 I found a couple of really suspicious numbers in my solr stats, aka lots of 
 entries were marked as isBot=false although the probably should has been 
 isBot=true.
 
 In the config file  I use
 
 spiderips.urls = http://iplists.com/google.txt, \
  http://iplists.com/inktomi.txt, \
  http://iplists.com/lycos.txt, \
  http://iplists.com/infoseek.txt, \
  http://iplists.com/altavista.txt, \
  http://iplists.com/excite.txt, \
  http://iplists.com/northernlight.txt, \
  http://iplists.com/misc.txt, \
  http://iplists.com/non_engines.txt
 
 
 I could not find downloadable lists for Bing, Baidu, Yahoo.
 The best I saw was:   
 http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html
 Is that reliable  ?
 
 Does anybody out there have lists / sources that they can share ?

What version of DSpace are you running?  Recent versions can also
recognize spiders by regular expression matching of the domain name or
UserAgent: string.  (However, that only works for new entries.  I've
recently found that some of the tools for loading and grooming the
stat.s core don't use SpiderDetector and are oblivious of these newer
patterns.)

 Also: does the dspace code gracefully deal with IP address patterns ?

That depends on what is considered graceful.  The code (in
org.dspace.statistics.util.IPTable) accepts patterns in three forms:

  11.22.33.44-11.22.33.55
  11.22.33.44
  11.22.33

Addresses in the first form may be suffixed with a CIDR mask-length,
but it is currently ignored.

If I've understood the code, a range (the first form) is assumed to
differ only in the fourth octet.  It will match all addresses between
44 and 55 within the /24 containing the start of the range.

The second form is an exact match of a single address.

The third form is a match of the first 24 bits -- an entire Class C
subnet.

There is no provision for IPv6.

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu


signature.asc
Description: Digital signature
--
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette