[Nutch-dev] Access to html content of downloaded pages

Chris Tryp Fri, 02 Jul 2004 18:14:50 -0700

Hi,



Using: nutch-2004-07-01.tar.gz, java version "1.4.2_02"

 

I have run "bin/nutch crawl ...." successfully.

I would like to access the raw html of the downloaded pages programmatically.



I have looked at net.nutch.db.WebDBReader, and was

able to access an enumeration of Page objects through

the method call WebDbReader.pages(). 

However, the Page object does not contain an html 

content field. 



I have also looked at net.nutch.protocol.Content, 

since it seemed to be able to access the html content.

However, I ran into an EOFException.

I tried to run the following command,

 

"bin/nutch net.nutch.protocol.Content 0 $SEG_DIR/20040702121629/"



I received the following EOFException:



Exception in thread "main" java.io.EOFException

        at java.io.RandomAccessFile.readFully(RandomAccessFile.java:365)

        at java.io.RandomAccessFile.readFully(RandomAccessFile.java:343)

        at net.nutch.io.SequenceFile$Reader.init(SequenceFile.java:150)

        at net.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:134)

        at net.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:127)

        at net.nutch.io.MapFile$Reader.<init>(MapFile.java:182)

        at net.nutch.io.MapFile$Reader.<init>(MapFile.java:160)

        at net.nutch.io.ArrayFile$Reader.<init>(ArrayFile.java:35)

        at net.nutch.protocol.Content.main(Content.java:157)







How do I correctly access the raw html content?



Thanks.

Chris











_______________________________________________
No banners. No pop-ups. No kidding.
Make My Way your home on the Web - http://www.myway.com


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Access to html content of downloaded pages

Reply via email to