Hey,

Try the command "bin/nutch readseg -dump"[1][2].
It reads a segment (or multiple segments) and output their content
including outlinks, html content, parsed content...

I hope it helps!

Remi

[1]:
http://www.marco.bianchi.name/myPortal/using-the-binnutch-readseg-command.aspx
[2]:  http://wiki.apache.org/nutch/bin/nutch_readseg

On Mon, Mar 26, 2012 at 12:39 AM, JohnRodey <[email protected]> wrote:

> I am just doing a simple project for my Information Retrieval class.  I am
> currently using nutch to get a bunch of pages and it is indexing and
> storing
> the parsed page to SOLR.  What I really want to do is have it store the
> page
> source with HTML tags as well.  Is there an easy way to tell nutch to do
> that?
>
> If not, after I have my pages indexed if I want to retrieve there original
> source from nutch what would be the command to do that?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Out-of-the-box-Nutch-indexing-url-source-to-Solr-tp3855918p3855918.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to