Hi Melroy,

On Aug 24, 2009, at 12:20pm, melroyr wrote:


I have written a program to download html pages from harristeeter. However,
when I run my program, I get the following

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd";>
<html>
<head>
<title>Your Personal Shopping List</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

[snip]

</frameset>
<frame src="actions.jsp" name="bottomFrame" scrolling="YES" noresize>
</frameset>

<noframes><body>
This application requires the use of frames, which your browser does not
support.
</body></noframes>

</html>

The URL I am using to download the pages is
http://flyer.harristeeter.com/HT_eVIC/ThisWeek/ReviewAllSpecials.jsp

Please advise if there is some setting that I need do set in HttpClient? I have read about HtmlCleaner and stuff but I do not think they will help.

Well, first it would help to know what you think is the problem. The above page seems OK to me.

If I had to guess, the issue is that you want the content of the frame (e.g. the <frame src="xxx"> link)

If so, then HttpClient can't automagically help you here. Easiest approach would be to use a regex to extract the src="xxx" links, convert them from relative to absolute, and fetch again...similar to what a real web crawler might do.

-- Ken


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to