Hey, Try the command "bin/nutch readseg -dump"[1][2]. It reads a segment (or multiple segments) and output their content including outlinks, html content, parsed content...
I hope it helps! Remi [1]: http://www.marco.bianchi.name/myPortal/using-the-binnutch-readseg-command.aspx [2]: http://wiki.apache.org/nutch/bin/nutch_readseg On Mon, Mar 26, 2012 at 12:39 AM, JohnRodey <[email protected]> wrote: > I am just doing a simple project for my Information Retrieval class. I am > currently using nutch to get a bunch of pages and it is indexing and > storing > the parsed page to SOLR. What I really want to do is have it store the > page > source with HTML tags as well. Is there an easy way to tell nutch to do > that? > > If not, after I have my pages indexed if I want to retrieve there original > source from nutch what would be the command to do that? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-source-to-Solr-tp3855918p3855918.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

