The AlreadyVisitedFilter should not ignore the parameters of the URI
--------------------------------------------------------------------
Key: DROIDS-144
URL: https://issues.apache.org/jira/browse/DROIDS-144
Project: Droids
Issue Type: Improvement
Components: core
Affects Versions: 0.0.2
Reporter: Eugen Paraschiv
Fix For: 0.0.2
Thiis filter strips the parameters from the URI and stores only the resulting
URI as key in it's visited map. This severely limits the filter, because
multiple URIs are now ignored because the filter sees them as visited, when in
fact they're not.
An example - these are pages to be crawled:
http://www.domain.com/abc/?page=0&start=
http://www.domain.com/abc/?page=1&start=
Once the first one is analyzed, only the host, and path are considered:
http://www.domain.com/abc/
and so the second URI will be rejected as already visited, when in fact it's a
completely new page.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira