Dear Wiki user, You have subscribed to a wiki page or wiki category on "Jakarta-httpclient Wiki" for change notification.
The following page has been changed by RolandWeber: http://wiki.apache.org/jakarta-httpclient/ForAbsoluteBeginners The comment on the change is: moved ------------------------------------------------------------------------------ - #pragma section-numbers 2 + #deprecated - = Client HTTP Programming Primer = + This page has been [http://wiki.apache.org/HttpComponents/ForAbsoluteBeginners moved] to the new [http://wiki.apache.org/HttpComponents/ HttpComponents Wiki]. - == About == - - This document is intended for people who suddenly have to or want to implement - an application that automates something usually done with a browser, - but are missing the background to understand what they actually need to do. - It provides guidance on the steps required to implement a program that - interacts with a web site which is designed to be used with a browser. - It does not save you from eventually learning the background of what - you are doing, but it should help you to get started quickly and learn - the details later. - [[BR]] - This document has evolved from discussions on the HttpClient mailing lists. - Although it refers to HttpClient, the concepts described here apply equally - to HttpComponents or SUN's [http://java.sun.com/j2se/1.4.2/docs/api/java/net/HttpURLConnection.html HttpURLConnection] or any other - HTTP communication library for any programming language. So you might - find it useful even if you're not using Java and HttpClient. - [[BR]] - The existence of this document does not imply that the HttpClient community - feels responsible for teaching you how to program a client HTTP application. - It is merely a way for us to reduce the noise on the mailing list without - just leaving the newbies out in the cold. - - - ---- - [[TableOfContents]] - ---- - - - == Scenario == - - Let's assume that you have some kind of repetitive, web-based task that - you want to automate. Something like: - - * goto page http:''''''//xxx.yyy.zzz/login.html - * enter username and password in a web form and hit the "login" button - * navigate to a specific page - * check the number/headline/whatever shown on that page - - At this time, we don't have a specific example which could be developed - into a sample application. So this document is all bla-bla, and you will - have to work out the details - all the details - yourself. Such is life. - - - === Caveat === - - This scenario describes a hobbyist usage of HTTP, in other words: - '''a bad practice'''. Web sites are designed for user interaction, not - as an application programming interface (API). The interface of a - web site is the user interface displayed by a browser. The HTTP - communication between the browser and the server is an internal API, - subject to change without notice. - [[BR]] - A web site can be redesigned at any point in time. The server then - sends different documents and a browser will display the new content. - The user easily adjusts to click the appropriate links, and the browser - communicates via HTTP as specified by the new documents from the server. - Your application that only mimicks a browser will simply break. - [[BR]] - Nevertheless, implementing this scenario will help you to get - familiar with HTTP communication. It is also "good enough" for - hobbyists applications, for example if you want to download the - latest installment of your favorite daily webcomic to install - it as the screen background. There is no big damage if such an - application breaks. - - If you want to implement a solid application, you should use only - published APIs. For example, to check for new mail on your webmail - account, you should ask the webmail provider for POP or IMAP access. - These are standardized protocols supported my most EMail client applications. - If you want to have a newsticker, look for RSS feeds from the provider and - applications that display them. - [[BR]] - As another example, if you want to perform a web search, there are - search companies that provide an API for using their search engines. - Unlike the examples before, such APIs are proprietary. You will still - have to implement an application, but then you are using a published API - that the provider will not change without notice. - - - - == Not a Browser == - - HttpClient is not a browser. Here's the difference. - - - === Browser === - - attachment:browser.png - - The figure shows some of the components you will find in a browser. - To the left, there is the user interface. The browser needs a rendering - engine to display pages, and to interpret user input such as mouse clicks - somewhere on the displayed page. There is a layout engine which computes - how an HTML page should be displayed, including cascading style sheets - and images. A Java''''''Script interpreter runs Java''''''Script code embedded in - or referenced from HTML pages. Events from the user interface are passed - to the Java''''''Script interpreter for processing. - On the top, there are interfaces for plugins that can handle Applets, - embedded media objects like PDF files, Quicktime movies and Flash animations, - or ActiveX controls that can do anything. - - In the center of the figure you can find internal components. Browsers - have a cache of recently accessed documents and image files. They need - to remember cookies and passwords entered by the user. Such information - can be kept in memory or stored persistently in the file system at the - bottom of the figure, to be available again when the browser is restarted. - Certificates for secure communication are almost always stored persistently. - To the right of the figure is the network. Browsers support many protocols - on different levels of abstraction. There are application protocols - such as FTP and HTTP to retrieve documents from servers, and transport - layer protocols such as TLS/SSL and Socks to establish connections for - the application protocols. - - One characteristic of browsers that is not shown in the figure is tolerance - for bad input. There needs to be tolerance for invalid user input to make - the browser user friendly. There also needs to be tolerance for malformed - documents retrieved from servers, and for flaws in server behavior when - executing protocols, to make as many websites as possible accessible to - the user. - - - === HTTP Client === - - attachment:httpclient.png - - The figure shows some of the components you will find in a browser, - and highlights the scope of HttpClient. The primary responsibility - of HttpClient is the HTTP protocol, executed directly or through an - HTTP proxy. It provides interfaces and default implementations for - cookie and password management, but not for persisting such data. - User interfacing, HTML parsing, plugins or non-HTTP application level - protocols are not in the scope of HttpClient. It does provide interfaces - to plug in transport layer protocols, but it does not implement such - protocols. - - All the rest of a browser's functionality you require needs to be - provided by your application. HttpClient executes HTTP requests, but it - will not and can not assemble them. Since HttpClient does not interface - with the user, nor interpret content such as HTML files, there is - little or no tolerance for bad data passed to the API. There is some - tolerance for flaws in server behavior, but there are limits to the - deviations HttpClient can handle. - - - - == Terminology == - - This section introduces some important terms you have to know to - understand the rest of this document. - - HTTP Message:: - consists of a header section and an optional entity. There are two kinds - of messages, requests and responses. They differ in the format of the - first line, but both can have header fields and an optional entity. - - HTTP Request:: - is sent from a client to a server. The first line includes the URI - for which the request is sent, and a method that the server should - execute for the client. - - HTTP Response:: - is sent from a server to a client in response to a request. The first - line includes a status code that tells about success or failure of - the request. HTTP defines a set of status codes, like 200 for success - and 404 for not found. Other protocols based on HTTP can define - additional status codes. - - Method:: - is an operation requested from the server. HTTP defines a set of - operations, the most frequent being GET and POST. Other protocols - based on HTTP can define additional methods. - - Header Fields:: - are name-value pairs, where both name and value are text. The name of - a header field is not case sensitive. Multiple values can be assigned - to the same name. RFC 2616 defines a wide range - of header fields for handling various aspects of the HTTP protocol. - Other specifications, like RFC 2617 and RFC 2965, define additional - headers. Some of the defined headers are for general use, others are - meant for exclusive use with either requests or responses, still others - are meant for use only with an entity. - - Entity:: - is data sent with an HTTP message. For example, a response can contain - the page or image you are downloading as an entity, or a request can - include the parameters that you entered into a web form. - The entity of an HTTP message can have an arbitrary data format, which - is usually specified as a MIME type in a header field. - - Session:: - is a series of requests from a single source to a server. The server - can keep session data, and needs to recognize the session to which - each incoming request belongs. For example, if you execute a web search, - the server will only return one page of search results. But it keeps - track of the other results and makes them available when you click on - the link to the "next" page. The server needs to know from the request - that it is you and your session for which more results are requested, - and not me and my session. That's because I searched for something else. - - Cookies:: - are the preferred way for servers to track sessions. The server supplies - a piece of data, called a cookie, in response to a request. The server - expects the client to send that piece of data in a header field with each - following request of the same session. - The cookie is different for each session, so the server can identify to - which session a request belongs by looking at the cookie. If the cookie - is missing from a request, the server will not respond as expected. - - - - == Step by Step == - - - === GET the Login Page === - - - Create and execute a GET request for the login page. - Just use the link you would type into the browser as the URL. - This is what a browser does when you enter a URL in the address bar - or when you click on a link that points to another web page. - - Inspect the response from the server: - - * do you get the page you expected? - - It should be sent as the entity of the response to your request. - The entity is also referred to as the response body. - - * do you get a session cookie? - - Cookies are sent in a header field named Set-Cookie or Set-Cookie2. - It is possible that you don't get a session cookie until you log in. - If there is no session cookie in the response, you'll have to do perform - step 2 later, after you reach the point where the cookie is set. - - If you do not get the page you expect, check the URL you are requesting. - If it is correct, the server may use a browser detection. You will have - to set the header field User-Agent to a value used by a popular browser - to pretend that the request is coming from that browser. - - If you can't get the login page, get the home page instead now. - Get the login page in the next step, when you establish the session. - - - === Establish the Session === - - Create and execute another GET request for a page. - You can simply request the login page again, or some other page - of which you know the URL. Do NOT try to get a page which would - be returned in response to submitting a web form. Use something - you can reach simply by clicking on a link in the browser. Something - where you can see the URL in the browser status line while the - mouse pointer is hovering over the link. - [[BR]] - This step is important when developing the application. Once you know - that your application does establish the session correctly, you may - be able to remove it. Only if you couldn't get the login page directly - and had to get the home page first, you know you have to leave it in. - - Inspect the request being sent to the server. - - * is the session cookie sent with the request? - - You can see what is sent to the server by enabling the - [http://jakarta.apache.org/commons/httpclient/logging.html wire log] - for HttpClient. You only need to see the request headers, not the body. - The session cookie should be sent in a header field called Cookie. - There may be several of those, and other cookies might be sent as well. - - Inspect the response from the server: - - * do you get another session cookie? - - You should not get another session cookie. If you get the same session - cookie as before, the server behaves a little strange but that should - not be a problem. If you get a new session cookie, then the server did - not recognize the session for the request. Usually, this happens if the - request did not contain the session cookie. But servers might use other - means to track sessions, or to detect session hijacking. - - - If the session cookie is not sent in the request, one of two things - has gone wrong. Either the cookie was not detected in the previous - response, or the cookie was not selected for being sent with the new - request. - [[BR]] - HttpClient automatically parses cookies sent in responses and puts them - in an object called {{{HttpState}}}. HttpClient uses a configurable cookie policy - to decide whether a cookie being sent from a server is correct. - The default policy complies strictly with RFC 2109, but many servers - do not. Play around with the cookie policies until the cookie is - accepted and put into the {{{HttpState}}}. - [[BR]] - If the cookie is accepted from the previous response but still not - sent with the new request, make sure that HttpClient uses the same - {{{HttpState}}} object. Unless you explicitly manage {{{HttpState}}} objects - (not recommended for newbies!), this will be the case if you use - the same HttpClient object to execute both requests. - [[BR]] - If the cookie is still not sent with the request, make sure that the - URL you are requesting is in the scope for the cookie. Cookies are - only sent to the domain and path specified in the cookie scope. - A cookie for host "jakarta.apache.org" will not be sent to host - "tomcat.apache.org". A cookie for domain ".apache.org" will be sent - to both. A cookie for host "apache.org", without the leading dot, - will not be sent to "jakarta.apache.org". The latter case can be - resolved by using a different cookie spec that adds the leading dot. - In the other cases, use a URL that in the cookie scope to establish - the session. - - If the session cookie is sent with the request, but a new session cookie - is set in the response anyway, check whether there are cookies other - than the session cookie in the request. Some servers are incapable of - detecting multiple cookies sent in individual header fields. HttpClient - can be advised to put all cookies into a single header field. - [[BR]] - If that doesn't help, you are in trouble. The server may use additional - means to track the session, for example the header field named Referer. - Set that field to the URL of the previous request. - ([http://mail-archives.apache.org/mod_mbox/jakarta-httpclient-user/200602.mbox/[EMAIL PROTECTED] see this mail]) - [[BR]] - If that doesn't help either, you will have to compare the request from - your application to a corresponding one generated by a browser. The - instructions in step 5 for POST requests apply for GET requests as well. - It's even simpler with GET, since you don't have an entity. - - - === Analyze the Form === - - Now it is time to analyze the form defined in the HTML markup of the page. - A form in HTML is a set of name-value-pairs called parameters, where some - of the values can be entered in the browser. By analyzing the HTML markup, - you can learn which parameters you have to define and how to send them - to the server. - [[BR]] - Look for the <form> tag in the page source. There may be several forms in - the page, but they can not be nested. Locate the form you want to submit. - Locate the matching </form> tag. Everything in between the two may be - relevant. Let's start with the attributes of the <form> tag: - - method=:: - specifies the method used for submitting the form. If it is GET or - not specified at all, then you need to create a GET request. The parameters - will be added as a query string to the URL. If the method is POST, you - need to create a POST request. The parameters will be put in the entity - of the request, also referred to as the request body. - How to do that is discussed in step 5. - - action=:: - specifies the URL to which the request has to be sent. Do not try to - get this URL from the address bar of your browser! A browser will - automatically follow redirects and only displays the final URL, which - can be different from the URL in this attribute. - It is possible that the URL includes a query string that specifies - some parameters. If so, keep that in mind. - - enctype=:: - specifies the MIME type for the entity of the request generated by the - form. The two common cases are url-encoded (default) and multipart-mime. - Note that these terms are just informally used here, the exact values - that need to be written in an HTML document are specified elsewhere. - This attribute is only used for the POST method. If the method is GET, - the parameters will always be url-encoded, but not in an entity. - - accept-charset=:: - specifies the character set that the browser should allow for user input. - It will not be discussed here, but you will have to consider this value - if you experience charset related problems. - - - Except for optional query parameters in the action attribute, the parameters - of a form are specified by HTML tags between <form> and </form>. - The following is a list of tags that can be used to define parameters. - Except where stated otherwise, they have a name attribute which specifies - the name of the parameter. The value of the parameter usually depends on - user input. - - {{{<input type="text" name="...">}}} - [[BR]] - {{{<input type="password" name="...">}}} - specify single-line input fields. Using the return key in one of these - fields will submit the form, so the value really is a single line of - input from the user. - - [[BR]] - {{{<input type="text" readonly name="..." value="...">}}} - [[BR]] - {{{<input type="hidden" name="..." value="...">}}} - specify a parameter that can not be changed by the user. - The value of the parameter is given by the value attribute. - - [[BR]] - {{{<input type="radio" name="..." value="...">}}} - [[BR]] - {{{<input type="checkbox" name="..." value="...">}}} - specify a parameter that can be included or omitted. There usually is - more than one tag with the same name. For radio buttons, only one can - be selected and the value of the parameter is the value of the selected - radio button. For checkboxes, more than one can be selected. There will - be one name-value-pair for each selected checkbox, with the same name - for all of them. - - [[BR]] - {{{<input type="submit" name="..." value="...">}}} - [[BR]] - {{{<button type="submit" name="..." value="...">}}} - [[BR]] - specify a button to submit the form. The parameter will only be added - to the form if that button is used to submit. If another button is used, - or the form is submitted by pressing the return key in a text input field, - the parameter is not part of the submitted form data. If the name attribute - is missing, no parameter is added to the form data for that button. - - [[BR]] - {{{<textarea name="...">}}} - [[BR]] - {{{<textarea value="..." readonly>}}} - [[BR]] - specify a multi-line input field. In the readonly case, the value of - the parameter is the text between the <textarea> and </textarea> tags. - - [[BR]] - {{{<select name="..." multiple>}}} - [[BR]] - {{{ <option value="...">...</option>}}} - [[BR]] - {{{ <option value="...">...</option>}}} - [[BR]] - {{{ ...}}} - [[BR]] - {{{</select>}}} - [[BR]] - specify a selection list or drop-down menu. If the multiple attribute is - not present, only one option can be selected. There will be one - name-value-pair for each selected option, with the same name for all of them. - If there is no value attribute, the value for that option is - the text between <option> and </option>. - - [[BR]] - {{{<input type="image" name="...">}}} - [[BR]] - specifies an image that can be clicked to submit the form. If that image - is clicked to submit the form, two parameters are added to the form data. - The name attribute is suffixed with ".x" and ".y", the values for the - parameters are the relative coordinates of the mouse pointer within the - image at the time of the click, in pixel. If the name attribute is missing, - no parameters will be added to the form data. - - [[BR]] - {{{<input type="file" name="...">}}} - [[BR]] - specifies a file selection box. The user can select a file that should - be sent as part of the form data. This is only possible if the encoding - is multipart-mime. Unlike other parameters, the file is not mapped to a - simple name-value-pair. File upload is not a topic for beginners. - - These tags are used to define parameters in static HTML. With dynamic HTML, - in particular Java''''''Script, the parameter values can be changed before the - form is submitted. If that is the case, you are in trouble. Learn Java''''''Script, - analyze the code that is executed, and modify your application to match - that behavior. - - - === Analyze the Form, Again === - - After you have determined the action URL and name-value-pairs of - a form, you should exit the program you used to get the HTML source, - start it again and repeat the analysis with the new page. - - Most parameters will be the same for both pages. But some parameters, - in particular those from hidden input fields, may change from session - to session, or even with every request. The same can be the case with - the action URL. - [[BR]] - Parameters that remain the same can be hard-coded in your program. - If parameters change (except for user input), then your application - has to request the page with the form and extract the dynamic parameters - at runtime. If you're lucky you can locate them by simple string searches. - If you're unlucky, you need an HTML parser to make sense of the page. - HTML parsing is out of scope for HttpClient, but you'll find some - HTML parsers mentioned in the mailing list archives. - - Note that a redesign of the form on the server can break your application - at any time. Whenever that happens, you have to repeat the analysis with - the new form returned by the server after the redesign, and adjust your - application accordingly. - - - === POST the Form === - - After analyzing the form, it is time to create a request that matches - what a browser would generate. If the method is GET, just add the - name-value-pairs for all parameters to the query string. If the method - is POST, things are a little more complicated. - [[BR]] - It depends on the server how closely you have to match browser behavior. - For example, a servlet will not distinguish between parameters in the - query string and url-encoded parameters of the entity. But other server - side code might make that distinction. The safe way is always to match - browser behavior exactly. - [[BR]] - HttpClient supports both encoding types, url-encoded and multipart-mime. - To send parameters url-encoded, use the {{{PostMethod}}} and add the parameters - directly there. To send parameters in multipart-mime, collect the parameters - in a {{{MultipartRequestEntity}}} and add set the entity for the {{{PostMethod}}}. - You will also find support for file upload in the multipart package. - Note that these techniques are mutually exclusive, they can not be combined. - Parameters defined in the query string of the URL can remain there. - - Send the request. Inspect the response from the server: - - * do you get a status code 303 or 307? - - That is called a redirect. Follow redirects to the ultimate page - and inspect that response. See step 6 on following redirects. - - * do you get the page you expected? - - If the server response to your POST request indicates a problem, - try to enable or disable the expect-continue handshake, or switch - the protocol version to HTTP/1.0. If that doesn't help... - - Inspect the request you are sending: - - * are there significant differences to the request of a browser? - - There is a variety of sniffer programs you can use to grep the - browser request. Some of them are mentioned in the responses - to [http://mail-archives.apache.org/mod_mbox/jakarta-httpclient-user/200603.mbox/[EMAIL PROTECTED] this question] on the mailing list. - - Candidates for problems are missing or wrong parameters, and differences - in the header fields. The parameters are all up to you. As a general rule - for the header fields, you should send the same as the browser does. The - order of the fields does not matter. - [[BR]] - But there's a caveat: some header fields are controlled by HttpClient and - can not be set explicitly. Other header fields are used to indicate - capabilities which a browser has, but your application probably has not. - For these, the request from your application has to and should differ. - Here is a possibly incomplete list of headers that need special consideration: - - {{{Host:}}} - [[BR]] - controlled by HttpClient. The value is usually obtained from the URL - you are posting to. It is possible to set a different value, called - a "virtual host". - - [[BR]] - {{{Content-Type:}}} - [[BR]] - {{{Content-Length:}}} - [[BR]] - {{{Transfer-Encoding:}}} - [[BR]] - controlled by HttpClient. The values are obtained from the request entity. - - [[BR]] - {{{Connection:}}} - [[BR]] - usually controlled by HttpClient to handle connection keep-alive. - Leave it alone or set the value to "close". - - [[BR]] - {{{Content-Encoding:}}} - [[BR]] - used to indicate the capability to process compressed responses. - Do not set this, unless you are prepared to implement decompression. - - - === Follow Redirects === - - It is quite common for servers to respond with a 303 or 307 status code - to a POST request. These redirects indicate that your application has to - send another request to retrieve the actual result of the operation you - have triggered with the POST request. - [[BR]] - HttpClient can be configured to follow some redirects automatically. - Others it is not allowed to follow automatically, since RFC 2616 specifies - that a user interaction should take place. We will make sure that HttpClient - is compliant with this requirement, but we can't stop you from implementing - a different behavior in your application. The Location header field in the - redirect response indicates the URL from which to fetch the actual page. - It is common practice that servers return a relative URL as the location, - although the specification requires an absolute URL. - [[BR]] - Note that there may be more than one redirect in succession. Your - application then has to follow the redirect for a redirect, but make sure - that you do not enter an infinite loop. If you find that there are more - than two redirects in succession, something probably is fishy. - - - === Logout === - - Your application can send as many GET and POST requests and follow as many - redirects as is required. But you should remember that there is a session - tracked by the server. Once your application is done, and if the web site - does provide a logout link, you should send a final request to log out. - This will tell the server that the session data can be discarded. If the - server prevents multiple logins with the same user ID and your application - has to run repeatedly, logout may even be required. - - - - == Further Reading == - - - ReferenceMaterials: a list of technical specifications for HTTP and related stuff. - - - [http://www.w3.org/TR/html4/interact/forms.html HTML 4.01 Specification], Section on Forms: - Includes how browsers have to generate the data to submit to the server. - - - [http://www.webreference.com/html/tutorial13/ Giving Form to Forms]: - Explains how to define HTML forms and what is submitted to the server. - Probably easier to digest than the HTML 4.01 Specification. - - - [http://java.sun.com/developer/technicalArticles/InnerWorkings/BackstageSession/index.html JDC and Session Management]: - Details of a real site using session tracking, login forms and redirects. - - - [http://jakarta.apache.org/commons/fileupload/ Commons File Upload]: - Server-side library for parsing multipart requests. - - - [http://www.cs.tut.fi/~jkorpela/forms/file.html Tutorial on File Upload in HTML] - - - [http://jakarta.apache.org/commons/httpclient/userguide.html HttpClient User Guide] - --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]