[ 
http://jira.dspace.org/jira/browse/DS-440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11137#action_11137
 ] 

Mark Diggory commented on DS-440:
---------------------------------

I created changes last night to create a directory and load spiders in 
individual files.  It will have the following features:

1.) spiders are stored under a URL escaped version of URL the file was 
retrieved from. Retrieval was changed to use Ant Get task in java to pull these 
files only if they are updated. all spider files will be stored under 
${dspace.dir}/config/spiders

2.) SpiderDector loads all the files into a HashSet rather than Vector to 
eliminate duplicates, a wrapper class with IP Range detection was written so 
that IP's can be matched to partial subnet entries without expanding them.

3.) Adjusted the StatisticsUsageEvent class to pass the whole request into the 
SpiderDetector so that User Agents and IP's can be tested.

4.) removed prepending IP's to Solr Query

5.) Added pre-filtering step that evaluates IP before calling SolrLogger.post.

Still to do

a.) Add Class that will prune the statistics Solr repo (via cron job)

b.) Add code to auto-detect robots.txt reads and add to 
SpiderDetector/serialize to separate file for them

> spiders.txt empty
> -----------------
>
>                 Key: DS-440
>                 URL: http://jira.dspace.org/jira/browse/DS-440
>             Project: DSpace 1.x
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Stuart Lewis
>            Assignee: Mark Diggory
>             Fix For: 1.6.0
>
>         Attachments: [DS-440]_spiders_txt_is_empty.patch.txt
>
>
> spiders.txt is currently empty, so search engine robots are not being 
> excluded from solr stats.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to