Re: [twsocket] webpage image source

RTT Sat, 02 May 2009 13:32:47 -0700

Just read sections 3 and 4
http://www.faqs.org/rfcs/rfc1808.html


Section 3 explain how you define, to the page you are parsing, its base 
URL. The base URL is needed to resolve later relative URLs.
Relative URLs resolving method is explained at section 4.
> I thought that the root dir could be something definable, and therefore
> may be different than the domain name?
> Like the root dir for
> "www.geocities.com/Athens/111/delphi/docs/sockets.html" would be
> "www.geocities.com/Athens/111/delphi"? But you say it would be
> "www.geocities.com", correct?
>
> Thanks for the replies - this is helpful.
>
>
> <-----Original Message-----> 
>   
>> From: Francois PIETTE [francois.pie...@skynet.be]
>> Sent: 5/2/2009 2:21:07 AM
>> To: twsocket@elists.org
>> Subject: Re: [twsocket] webpage image source
>>
>>     
>>> Hi, I'm using httpcli to save a webpage html doc and I extract all of
>>> it's image locations to a text file by saving the '<IMG SRC=' tags.
>>> Afterward I want to download all of the images, but how can I
>>>       
> determine
>   
>>> the TRUE location of the images? For example, say the image tag is:
>>> '<IMG SRC='test.com/photo.jpg'' - for all I know, "test.com" could
>>>       
> just
>   
>>> be a directory on the server or it could be the website. Another
>>> example, say the image tag is: '<IMG SRC='/photo.jpg'' - so the image
>>>       
> is
>   
>>> in the root directory of the website, but who knows what the root
>>> directory is? It may simply be 'test.com', or if the html doc is
>>>       
> located
>   
>>> in a subdirectory, it may be something like 'test.com/users/me'.
>>>
>>> So, what is the appropriate way to determine the actual true location
>>>       
> of
>   
>>> these images from the 'IMG' tags?
>>>       
>> If the image URL starts with "/" then it is an absolute URL. Just
>>     
> prepend 
>   
>> the website URL and you have the image URL.
>> If the image URL doesn't starts with "/", then it is a relative URL.
>>     
> You 
>   
>> must prepent de URL of the page where the you've found the image,
>>     
> excluding 
>   
>> the document itself.
>>
>> Example: Assuming you are getting a page from 
>> "http://www.mysite.com/docs/page.html";.
>> If you find an image source URL as "/photo.jpg" then the complete URL
>>     
> is 
>   
>> "http://www.mysite.com/photo.jpg";
>> If you find an image with URL "test.com/photo.jpg" then the complete
>>     
> URL is 
>   
>> "http://www.mysite.com/docs/test.com/photo.jpg";
>>
>>
>>     
>>> but who knows what the root directory is?
>>>       
>> The root directory is alwas easy to find. It is the URL starting from 
>> "http:" up to the first "/". In my above example, the root is simply 
>> "http://www.mysite.com";.
>>
>> --
>> francois.pie...@overbyte.be
>> The author of the freeware multi-tier middleware MidWare
>> The author of the freeware Internet Component Suite (ICS)
>> http://www.overbyte.be
>>
>> -- 
>> To unsubscribe or change your settings for TWSocket mailing list
>> please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
>> Visit our website at http://www.overbyte.be
>> .
>>
>>     
>
>
> <span id=m2wTl><p><font face="Arial, Helvetica, sans-serif" size="2" 
> style="font-size:13.5px">_______________________________________________________________<BR>Get
>  the Free email that has everyone talking at <a 
> href=http://www.mail2world.com target=new>http://www.mail2world.com</a><br>  
> <font color=#999999>Unlimited Email Storage &#150; POP3 &#150; Calendar 
> &#150; SMS &#150; Translator &#150; Much More!</font></font></span>
>   

-- 
To unsubscribe or change your settings for TWSocket mailing list
please goto http://lists.elists.org/cgi-bin/mailman/listinfo/twsocket
Visit our website at http://www.overbyte.be

Re: [twsocket] webpage image source

Reply via email to