[
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doğacan Güney updated NUTCH-505:
--------------------------------
Attachment: NUTCH-505-v2.patch
After my last commit, I read that Sun's java.util.regex implementation is
actually faster than jakarta-oro. So, I changed UrlValidator to use
java.util.regex instead of jakarta-oro. I made some simple tests and
java.util.regex really seems to be faster. I also added some basic
optimizations to ParseOutputFormat (added initialCapacity arguments to
ArrayLists to reduce the number of allocations).
Is it necessary to reopen this issue or open another issue for this? I think
this one is simple enough to commit without opening a seperate issue, but feel
free to disagree.
Also, I realized that UrlValidator considers
http://www.iiit.net/images/CCCCCC_line_br[1].gif invalid, even though firefox
will display the gif (firefox escapes the path then fetches the escaped url).
This doesn't seem to be a problem right now since nutch can't fetch these urls
anyway, but we may consider adding some sort of smart escaping later.
> Outlink urls should be validated
> --------------------------------
>
> Key: NUTCH-505
> URL: https://issues.apache.org/jira/browse/NUTCH-505
> Project: Nutch
> Issue Type: Improvement
> Reporter: Doğacan Güney
> Assignee: Doğacan Güney
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch,
> NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation
> system that tests these urls and filters out garbage.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers