Thanks so much for your response Markus. I had no Idea about this concept. I will search for it more, and try to do it. If you know of any pages or tutorials for that, I would appreciate it if you could please send me some links or...
Bests Shakiba Davari On Thu, Jul 21, 2016 at 7:10 PM, Markus Jelsma <[email protected]> wrote: > Hello shakiba - the best solution, and solving a lot of additional > problems, is to make an indexing backend plugin, specifically for your > indexing service. The coding involved is quite straightforward, except any > nuances your indexing backend might have. > > Dumping files and reprocessing them for indexing purposes is not a good > idea because you loose the information such as CrawlDB status including new > 404's. > > M. > > > > -----Original message----- > > From:shakiba davari <[email protected]> > > Sent: Friday 22nd July 2016 0:57 > > To: [email protected] > > Subject: mapping files created by: nutch dump to the URL from which each > file has been dumped. > > > > Hi, I am trying to crawl some URLs in Apache Nutch, and then index them > > with Bluemix Retrieve And Rank service. To do so I crawl my data by Nutch > > and dump the crawled data as files (mostly html files) in a directory: > > > > bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/ > > > > Then I send these files to the Bluemix Document Converter Service to > create > > a json file "compatible with Bluemix R&R Service", and post this Json > file > > to my R&R Service. My question is: How can I find a way to map each one > of > > these files to their related URL? > > > > th commoncrawldump command creates json files that have URL field in > them, > > but i couldn't use these files for my purpose for various reasons, first > > being that they contain some binary signs. > > > > Bests > > Shakiba Davari > > >

