[
https://issues.apache.org/jira/browse/DROIDS-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eugen Paraschiv updated DROIDS-144:
-----------------------------------
Attachment: DROIDS-144.patch
> The AlreadyVisitedFilter should not ignore the parameters of the URI
> --------------------------------------------------------------------
>
> Key: DROIDS-144
> URL: https://issues.apache.org/jira/browse/DROIDS-144
> Project: Droids
> Issue Type: Improvement
> Components: core
> Affects Versions: 0.0.2
> Reporter: Eugen Paraschiv
> Fix For: 0.0.2
>
> Attachments: DROIDS-144.patch
>
>
> Thiis filter strips the parameters from the URI and stores only the resulting
> URI as key in it's visited map. This severely limits the filter, because
> multiple URIs are now ignored because the filter sees them as visited, when
> in fact they're not.
> An example - these are pages to be crawled:
> http://www.domain.com/abc/?page=0&start=
> http://www.domain.com/abc/?page=1&start=
> Once the first one is analyzed, only the host, and path are considered:
> http://www.domain.com/abc/
> and so the second URI will be rejected as already visited, when in fact it's
> a completely new page.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira