I was doing scraping using the same code for 4 completely different sites simultaneously, so I had a pretty object oriented approach (read polymorphic inheritance) and was running my object in multiple threads. But the basic principle remains the same. There may be many ways to do it... Here's how I did it and would suggest you to take the following steps :
1. Understand how the request-response system works for the web and how your browser works internally. 2. Study the HttpWebRequest and HttpWebResponse classes in the .NET Framework and understand how they can be used to mimic any web traffic. 3. Use Fiddler (v.2 parses HTTPS !) to analyze the HTTP post taking place when you submit data on the target site. That will give you a list of form parameters that are passed to the server with each request. Since this is an ASP.NET site, the ViewState would be one of the required parameters and must be passed untampered. 4. Create a (preferably configurable) list of the form parameters that are submitted with the request. An XML file that is dynamically loaded by the application was how I did it. 5. Load this list and create a querystring containing all those parameter names and the values they expect. The values here would be substituted by the values obtained from users of your intranet site. You might need to URLEncode the Viewstate value. 6. Convert the string to a byte array and write this byte array to the RequestStream exposed by your HttpWebRequest object. 7. Set as many properties of the HttpWebRequest object as you can... such as ProtocolVersion, Method, ContentType, Timeout, and any other Headers you need. This information can usually be found by studying the request.that Fiddler plugged into. 8. Use the HttpWebRequest.GetResponse() method to obtain a HttpWebResponse object that is the response returned by the server. If all works well, this would be the next page that you see when manually submitting data. You can then analyze this page for any data you want. Hope that helps ! On Dec 30, 5:24 am, Dixie Normous <[email protected]> wrote: > TechOwl, thanks for the links, let me try to present some more > specifics about my situation. > > The external site is an internet site which my site has been granted > access to. A group of users on site has a shared username and > password they use to manually browse the site to search for customer > info, kind of like a corporate B2B account a reseller might have to a > vendor's site. That site uses ASP .NET 1.1, but I've got no access > beyond the browser to this site per an agreement between our company > and theirs. > > I'm trying to build a tool which would basically involve me creating a > "wrapper" site on my corporate intranet to convey search queries to > the remote site. For example, a site on my intranet might show the > various search fields available, the user would input their criteria, > and when they submit this, my backend code (maybe running as a WCF > service or something on one of the servers on my intranet) would > package up and execute a browsing session (crawl?) to the remote site, > and retrieve any results of said query. > > Since the remote site doesn't expose any kind of API or anything that > I could query more directly (or less indirectly?!) it seems automating > the web browsing process is the next best option, in my limited > experience so far at this kind of stuff. Hope this helps further > clarify what I'm talking about; if any further details would help ask > away, meanwhile I'll definitely check out Fiddler as suggested by > yourself and Milo. > > On Dec 29, 6:53 am, TechOwl <[email protected]> wrote: > > > > > Dixie... > > > You are being a little too vague and not providing enough information > > for us to really help you. I believe what you are wanting to do is > > "doable" but we need more detailed information. > > > For example, is the external site a .NET site or something else? Such > > a distinction is important because if it is a .NET site (for example) > > then you have to be concerned with the things like viewstate, etc. > > > I have done this recently, and am also doing it now for a current > > project, so if you want to share additional details and get help, I > > can certainly try to help if you want to e-mail me or something. > > > This is not very straight-forward, and it is also not really a > > recommended way of developing solutions due to the dependencies > > inherent to the approach. > > > For starters... > > > 1) Get Fiddler & learn some about it:http://www.fiddlertool.com/fiddler/ > > 2) Look into the System.Net namespace (if you're dealing with .NET) > > :http://msdn.microsoft.com/en-us/library/system.net.aspx > > [specifically HttpWebRequest, WebResponse, as well as > > System.IO.StreamReader] > > > Hope this helps some!- Hide quoted text - > > - Show quoted text -
