tml_list[-1] if charset_from_html_list
else ''
return charset_from_html if charset_from_html else charset_from_header
> Date: Sun, 9 Jun 2013 04:47:02 -0700
> Subject: Re: how to detect the character encoding in a web page ?
> From: redstone-c...@163.com
> To: python-l
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
here is one thread that can help me understanding my code
http://stackoverflow.com/questions/17001407/how-to-detect-the-character-encoding-of-a-we
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
Finally ,I found by using PyQt’s QtextStream , QTextCodec and chardet ,we can
get a web page code more securely
even for this bad page
http://ww
On Thu, Jun 6, 2013 at 4:22 PM, Nobody wrote:
> On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:
>
>> The HTTP header is completely out of band. This is the best way to
>> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
>> parsing. Once you find a meta tag, you
On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote:
> The HTTP header is completely out of band. This is the best way to
> transmit encoding information. Otherwise, you assume 7-bit ASCII and start
> parsing. Once you find a meta tag, you stop parsing and go back to the
> top, decoding in th
On Thu, Jun 6, 2013 at 1:14 AM, iMath wrote:
> 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
>> how to detect the character encoding in a web page ?
>>
>> such as this page
>>
>>
>>
>> http://python.org/
>
> by the way ,we cannot get character encoding programmatically from the mate
> data without k
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
by the way ,we cannot get character encoding programmatically from the mate
data without knowing the character encoding ahead !
--
http://mail
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
I found PyQt’s QtextStream can very accurately detect the character encoding
in a web page .
even for this bad page
http://www.qnwz.cn/html/yin
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
I found PyQt’s QtextStream can very accurately detect the character encoding
in a web page .
even for this bad page
chardet and beautiful soup
In article ,
Roy Smith wrote:
>In article ,
> Alister wrote:
>
>> Indeed due to the poor quality of most websites it is not possible to be
>> 100% accurate for all sites.
>>
>> personally I would start by checking the doc type & then the meta data as
>> these should be quick & correct, I then us
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
up to now , maybe chadet is the only way to let python automatically do it .
--
http://mail.python.org/mailman/listinfo/python-list
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
first setup chardet
import chardet
#抓取网页html
html_1 = urllib2.urlopen(line,timeout=120).read()
#print html_1
mychar=chardet.detect(html_1)
#pri
In article ,
Alister wrote:
> Indeed due to the poor quality of most websites it is not possible to be
> 100% accurate for all sites.
>
> personally I would start by checking the doc type & then the meta data as
> these should be quick & correct, I then use chardectect only if these
> fail t
On Mon, 24 Dec 2012 13:50:39 +, Steven D'Aprano wrote:
> On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:
>
>> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
>> wrote:
>>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>>> with confidence 0.803579722043 $
>>
>> And it
On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:
> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
> wrote:
>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>> with confidence 0.803579722043 $
>
> And it sucks, because it uses magic, and not reading the HTML tags. The
> RI
On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
wrote:
> $ wget -q -O - http://python.org/ | chardetect.py
> stdin: ISO-8859-2 with confidence 0.803579722043
> $
And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definit
Am 24.12.2012 um 04:03 schrieb iMath:
> but how to let python do it for you ?
> such as these 2 pages
> http://python.org/
> http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
> how to detect the character encoding in these 2 pages by python ?
If you have the html code, let
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
but how to let python do it for you ?
such as these 2 pages
http://python.org/
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).a
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
but how to let python do it for you ?
such as these 2 pages
http://python.org/
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).a
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
but how to let python do it for you ?
such as this page
http://python.org/
how to detect the character encoding in this web page by python ?
On 24/12/12 01:34:47, iMath wrote:
> how to detect the character encoding in a web page ?
That depends on the site: different sites indicate
their encoding differently.
> such as this page: http://python.org/
If you download that page and look at the HTML code, you'll find a line:
So it's
On Mon, Dec 24, 2012 at 11:34 AM, iMath wrote:
> how to detect the character encoding in a web page ?
> such as this page
>
> http://python.org/
You read part-way into the page, where you find this:
That tells you that the character set is UTF-8.
ChrisA
--
http://mail.python.org/mailman/lis
22 matches
Mail list logo