Re: mapping files created by: nutch dump to the URL from which each file has been dumped.

shakiba davari Fri, 22 Jul 2016 13:21:20 -0700

Thanks so much for your response Markus.
I had no Idea about this concept. I will search for it more, and try to do
it. If you know of any pages or tutorials for that, I would appreciate it
if you could please send me some links or...


Bests
Shakiba Davari


On Thu, Jul 21, 2016 at 7:10 PM, Markus Jelsma <[email protected]>
wrote:

> Hello shakiba - the best solution, and solving a lot of additional
> problems, is to make an indexing backend plugin, specifically for your
> indexing service. The coding involved is quite straightforward, except any
> nuances your indexing backend might have.
>
> Dumping files and reprocessing them for indexing purposes is not a good
> idea because you loose the information such as CrawlDB status including new
> 404's.
>
> M.
>
>
>
> -----Original message-----
> > From:shakiba davari <[email protected]>
> > Sent: Friday 22nd July 2016 0:57
> > To: [email protected]
> > Subject: mapping files created by: nutch dump to the URL from which each
> file has been dumped.
> >
> > Hi, I am trying to crawl some URLs in Apache Nutch, and then index them
> > with Bluemix Retrieve And Rank service. To do so I crawl my data by Nutch
> > and dump the crawled data as files (mostly html files) in a directory:
> >
> > bin/nutch dump -flatdir -outputDir dumpData/ -segment crawl/segments/
> >
> > Then I send these files to the Bluemix Document Converter Service to
> create
> > a json file "compatible with Bluemix R&R Service", and post this Json
> file
> > to my R&R Service. My question is: How can I find a way to map each one
> of
> > these files to their related URL?
> >
> > th commoncrawldump command creates json files that have URL field in
> them,
> > but i couldn't use these files for my purpose for various reasons, first
> > being that they contain some binary signs.
> >
> > Bests
> > Shakiba Davari
> >
>

Re: mapping files created by: nutch dump to the URL from which each file has been dumped.

Reply via email to