Add additional filtering untill the file is saved on disk
---------------------------------------------------------
Key: DROIDS-142
URL: https://issues.apache.org/jira/browse/DROIDS-142
Project: Droids
Issue Type: New Feature
Components: core
Affects Versions: 0.0.2
Reporter: Eugen Paraschiv
Fix For: 0.0.2
The existing filtering process allows URLs to be accepted based on the URL
itself, which is very useful. There are some cases though where you need to
decide if the file is relevant and should be saved (or not) based on the
content itself.
There should be a step in SaveHandler before the file is actually saved, where
the handler should be able to decide if the file is to be persisted or ignored
based on the URL but also on the file contents itself. It is here that specific
checks should be introduced to further filter out the files.
- note: as an example of this, consider the very common site that doesn't
really have hierarchical, well defined URLs, but instead simple
/domain/object1, /domain/object2 type URLs; this links don't really say
anything about the content, so filtering them out by a regex would do no good;
the page itself however is likely to contain all the information required to
have a more granular filtering in place
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira