mstrewe opened a new issue, #1448:
URL: https://github.com/apache/incubator-stormcrawler/issues/1448
The BasicUrlNormalizer will encode links if they are not already URL
encoded.
The Bug occurs when URL has encoded chars in smaller case like
`'/Exhibitions/Detail/NjAxOA%3d%3d'`. (the URL
`'/Exhibitions/Detail/NjAxOA%3D%3D'` is not affected)
In BasicUrlNormalizer.java from line 145-150 the file of the URL gets
unescaped and escaped again. After that the original file and the es-unes-caped
file are compared. It will be
`Exhibitions/Detail/NjAxOA%3d%3d == Exhibitions/Detail/NjAxOA%3D%3D`
(Capital D)
After that the original source URL will be reacreated (line 154) and results
in 'Exhibitions/Detail/NjAxOA%253D%253D'
Can be fixed if the statement in line 148
```
if (!file.equals(file2)) {
```
will changed to
```
if (!file.toLowerCase().equals(file2.toLowerCase())) {
```
UpperCase doesnt matter. But now it does not
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]