Hi all,
I need to extract and analyse crawled urls because they may contain certain
parameters such as "?redirectURL=" that could point to new Documents to be
fetched and indexed.
First I was trying to create a subclass that extends
public class RedirectExtractor extends
org.apache.manifoldcf.agents.transformation.BaseTransformationConnector
and add a "RedirectExtractor" transformation step to the fetch process in
ManifoldCF, but it only allows me to modify current Document, not to create
a new FETCH from the extracted parameter.
I was investigating manifoldCF source code and I found something that may
be in hand
activities.recordActivity(null,ACTIVITY_FETCH,
null,urlValue,Integer.toString(-2),"Robots exclusion",null);
from the IProcessActivity interface, which is used by the Connectors. I
didn't want to create a new connector since it is a bit complex but, do you
see an alternative or this is the only way?
Thanks in advance.