Thanks, I suspected that while I was reviewing the code but I was hoping there was an alternative :)
Regards. El jue., 26 jul. 2018 a las 12:11, Karl Wright (<[email protected]>) escribió: > ManifoldCF has the concept of "compound document", but all the independent > "components" of the document must be identified at the root level (that is, > in the Repository Connector). > > I'm therefore afraid there is no good mapping from ManifoldCF concepts to > what you want to do without writing your own Repository Connector. > > Karl > > > On Thu, Jul 26, 2018 at 5:06 AM Gustavo Beneitez < > [email protected]> > wrote: > > > Hi Karl, > > > > I made a quick picture of what I really need (attached) > > > > Certain URLs coming from repository could be split into two: URL1 and > > URL2. > > > > Normal flow acts as only one is present, URL, but writing a new transform > > I could realise also that there is another one: URL2. > > My complain now is: "well, I have URL2 , how can then inject it to the > > flow in order to become a new URL from the repository (and then fetched, > > processed and ingested like others do)?". > > > > Thanks. > > > > > > > > El jue., 26 jul. 2018 a las 0:35, Karl Wright (<[email protected]>) > > escribió: > > > >> The crawled URL is transmitted as part of the RepositoryDocument object > to > >> the output connector. If this is going to Solr, it's used as the > >> document's ID. You can therefore customize Solr (or ElasticSearch) to > >> extract the data you need at the indexing end. > >> > >> If this doesn't make any sense to you, then please be more specific > about > >> what the disposition of each crawled document is. > >> > >> Thanks, > >> Karl > >> > >> > >> On Wed, Jul 25, 2018 at 5:57 PM Gustavo Beneitez < > >> [email protected]> > >> wrote: > >> > >> > Hi all, > >> > > >> > I need to extract and analyse crawled urls because they may contain > >> certain > >> > parameters such as "?redirectURL=" that could point to new Documents > to > >> be > >> > fetched and indexed. > >> > > >> > First I was trying to create a subclass that extends > >> > > >> > public class RedirectExtractor extends > >> > > org.apache.manifoldcf.agents.transformation.BaseTransformationConnector > >> > > >> > and add a "RedirectExtractor" transformation step to the fetch process > >> in > >> > ManifoldCF, but it only allows me to modify current Document, not to > >> create > >> > a new FETCH from the extracted parameter. > >> > > >> > I was investigating manifoldCF source code and I found something that > >> may > >> > be in hand > >> > > >> > activities.recordActivity(null,ACTIVITY_FETCH, > >> > null,urlValue,Integer.toString(-2),"Robots > >> > exclusion",null); > >> > > >> > from the IProcessActivity interface, which is used by the Connectors. > I > >> > didn't want to create a new connector since it is a bit complex but, > do > >> you > >> > see an alternative or this is the only way? > >> > > >> > Thanks in advance. > >> > > >> > > >
