Hi,
Using: nutch-2004-07-01.tar.gz, java version "1.4.2_02"
I have run "bin/nutch crawl ...." successfully.
I would like to access the raw html of the downloaded pages programmatically.
I have looked at net.nutch.db.WebDBReader, and was
able to access an enumeration of Page objects through
the method call WebDbReader.pages().
However, the Page object does not contain an html
content field.
I have also looked at net.nutch.protocol.Content,
since it seemed to be able to access the html content.
However, I ran into an EOFException.
I tried to run the following command,
"bin/nutch net.nutch.protocol.Content 0 $SEG_DIR/20040702121629/"
I received the following EOFException:
Exception in thread "main" java.io.EOFException
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:365)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:343)
at net.nutch.io.SequenceFile$Reader.init(SequenceFile.java:150)
at net.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:134)
at net.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:127)
at net.nutch.io.MapFile$Reader.<init>(MapFile.java:182)
at net.nutch.io.MapFile$Reader.<init>(MapFile.java:160)
at net.nutch.io.ArrayFile$Reader.<init>(ArrayFile.java:35)
at net.nutch.protocol.Content.main(Content.java:157)
How do I correctly access the raw html content?
Thanks.
Chris
_______________________________________________
No banners. No pop-ups. No kidding.
Make My Way your home on the Web - http://www.myway.com
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers