[ 
https://issues.apache.org/jira/browse/DROIDS-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugen Paraschiv updated DROIDS-144:
-----------------------------------

    Attachment: DROIDS-144.patch

> The AlreadyVisitedFilter should not ignore the parameters of the URI
> --------------------------------------------------------------------
>
>                 Key: DROIDS-144
>                 URL: https://issues.apache.org/jira/browse/DROIDS-144
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.0.2
>            Reporter: Eugen Paraschiv
>             Fix For: 0.0.2
>
>         Attachments: DROIDS-144.patch
>
>
> Thiis filter strips the parameters from the URI and stores only the resulting 
> URI as key in it's visited map. This severely limits the filter, because 
> multiple URIs are now ignored because the filter sees them as visited, when 
> in fact they're not. 
> An example - these are pages to be crawled: 
> http://www.domain.com/abc/?page=0&start=
> http://www.domain.com/abc/?page=1&start=
> Once the first one is analyzed, only the host, and path are considered: 
> http://www.domain.com/abc/
> and so the second URI will be rejected as already visited, when in fact it's 
> a completely new page. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to