Yes, I would separate the work of transforming documents from the work of fetching them.
Karl On Wed, Feb 20, 2019 at 9:46 PM Kayak28 <kaya.ota....@gmail.com> wrote: > Hello, Mr. Karl Wright: > > Thank you for quick response. > As you mentioned, yes I am so writing my Repository Connector to access > the REST api I want to use. > > If I need to do more scraping than provided html-extractor, then I should > write a transformer connector that works as I want. > Is the statement right? And it is not good idea to do scraping in my > Repository Connector, isn't it? > > Again, I appreciate for replying these basic questions. > > Sincerely, > Kaya > > > 2019年2月21日(木) 11:26 Karl Wright <daddy...@gmail.com>: > >> Hi Kaya, >> >> You should be able to use the existing Solr connector to index documents >> into Solr. >> You will probably need to write a Repository connector to access the REST >> api you describe. >> If the kind of scraping you need to do can be covered by the >> html-extractor transformer in its current form, then you can insert it into >> the pipeline between the other two connections and you should be all set. >> >> Karl >> >> >> On Wed, Feb 20, 2019 at 9:17 PM Kayak28 <kaya.ota....@gmail.com> wrote: >> >>> Hello, falks: >>> >>> I have a question about crawling and scraping in Manifold CF. >>> I want to the following sequence of tasks by using MCF. >>> >>> 1. crawling data from RESTful api >>> 2. scraping data >>> 3. insert the data to Apache Solr >>> >>> In this case, how I need to setup Manifold CF is: >>> 1. define output connector to access RESTful api (by using Web crawler >>> connector or Generic connector? ) >>> >>> 2. define transformer connector to scrap html (by using html-extractor >>> transformer connector...?) >>> 3. define output connector to be Solr >>> >>> >>> OR do I have to use other software such as Apache Nifi to control the >>> sequence of these tasks? >>> >>> I appreciate for any comments and replays. >>> >>> Sincerely, >>> Kaya >>> >>> >>>