protocol-http or protocol-httpclient can't get all page source

jeffersonzhou Fri, 13 May 2011 10:36:33 -0700

Hi,

I just notice that either protocol-http or protocol-httpclient can't get all
page source of a big html file. For instance, I want to get all the page
source of this url (http://www.taobao.com/), but the code below can't do the
job:


1) protocol-httpclient
try {
        byte[] buffer = new byte[HttpBase.BUFFER_SIZE];
        //byte[] buffer = new byte[contentLength];
        int bufferFilled = 0;
        int totalRead = 0;
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1
            && totalRead < contentLength) {
          totalRead += bufferFilled;
          out.write(buffer, 0, bufferFilled);
        }

        content = out.toByteArray();
        
      } 


2) protocol-http

String contentEncoding = getHeader(Response.CONTENT_ENCODING);
      if ("gzip".equals(contentEncoding) ||
"x-gzip".equals(contentEncoding)) {
        content = http.processGzipEncoded(content, url);
      } else if ("deflate".equals(contentEncoding)) {
       content = http.processDeflateEncoded(content, url);
      } else {
        if (Http.LOG.isTraceEnabled()) {
          Http.LOG.trace("fetched " + content.length + " bytes from " +
url);
        }
      }


How can I achieve what I want?

Thanks

protocol-http or protocol-httpclient can't get all page source

Reply via email to