Hello All, After performing a crawl using Nutch, I wanted to read the content of all the crawled URLs. I performed the following command: "$NUTCH_HOME/bin/nutch readseg -dump $segment myseg"; where, $segment contains the name of the segment file, and 'myseg' is the name of the directory where the dump of the segment is created. I have noticed that a complete dump of all the crawled URLs have been placed in one file named 'dump' within the directory 'myseg'.
I wanted to write the content of each URL into separate files. i.e. if there are N urls, then I wanted the content of N parsed URLs to be written to N files. I know, I can use 'wget' but want to do the same using Nutch. I found no means to do the same, hence I looked at the source code, in the Java Class, org.apache.nutch.segment.SegmentReader, In function dump(Path segment, Path output) Line no. 222, I replaced "job.setOutputFormat(TextOutputFormat.class);" with "job.setOutputFormat(MultipleTextOutputFormat.class);" Ref: http://hadoop.apache.org/common/docs/r0.19.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html After performing a crawl process again, I get the same output as earlier. Is there a direct way, by which I can write the content of each URL into separate files? Else, I shall write a program to parse the single 'dump' file and write the content to separate files which indeed doesn't seem to be appropriate. Does Nutch have a direct way? Cross-posted on nutch-user and nutch-dev mailing lists. -- Ankit Dangi