RE: how to detect the character encoding in a web page ?

2013-06-09 Thread Carlos Nepomuceno
tml_list[-1] if charset_from_html_list else '' return charset_from_html if charset_from_html else charset_from_header > Date: Sun, 9 Jun 2013 04:47:02 -0700 > Subject: Re: how to detect the character encoding in a web page ? > From: redstone-c...@163.com > To: python-l

Re: how to detect the character encoding in a web page ?

2013-06-09 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ here is one thread that can help me understanding my code http://stackoverflow.com/questions/17001407/how-to-detect-the-character-encoding-of-a-we

Re: how to detect the character encoding in a web page ?

2013-06-09 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ Finally ,I found by using PyQt’s QtextStream , QTextCodec and chardet ,we can get a web page code more securely even for this bad page http://ww

Re: how to detect the character encoding in a web page ?

2013-06-06 Thread Chris Angelico
On Thu, Jun 6, 2013 at 4:22 PM, Nobody wrote: > On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote: > >> The HTTP header is completely out of band. This is the best way to >> transmit encoding information. Otherwise, you assume 7-bit ASCII and start >> parsing. Once you find a meta tag, you

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread Nobody
On Thu, 06 Jun 2013 03:55:11 +1000, Chris Angelico wrote: > The HTTP header is completely out of band. This is the best way to > transmit encoding information. Otherwise, you assume 7-bit ASCII and start > parsing. Once you find a meta tag, you stop parsing and go back to the > top, decoding in th

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread Chris Angelico
On Thu, Jun 6, 2013 at 1:14 AM, iMath wrote: > 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: >> how to detect the character encoding in a web page ? >> >> such as this page >> >> >> >> http://python.org/ > > by the way ,we cannot get character encoding programmatically from the mate > data without k

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead ! -- http://mail

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ I found PyQt’s QtextStream can very accurately detect the character encoding in a web page . even for this bad page http://www.qnwz.cn/html/yin

Re: how to detect the character encoding in a web page ?

2013-06-05 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ I found PyQt’s QtextStream can very accurately detect the character encoding in a web page . even for this bad page chardet and beautiful soup

Re: how to detect the character encoding in a web page ?

2013-01-14 Thread Albert van der Horst
In article , Roy Smith wrote: >In article , > Alister wrote: > >> Indeed due to the poor quality of most websites it is not possible to be >> 100% accurate for all sites. >> >> personally I would start by checking the doc type & then the meta data as >> these should be quick & correct, I then us

Re: how to detect the character encoding in a web page ?

2013-01-07 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ up to now , maybe chadet is the only way to let python automatically do it . -- http://mail.python.org/mailman/listinfo/python-list

Re: how to detect the character encoding in a web page ?

2012-12-28 Thread python培训
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ first setup chardet import chardet #抓取网页html html_1 = urllib2.urlopen(line,timeout=120).read() #print html_1 mychar=chardet.detect(html_1) #pri

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Roy Smith
In article , Alister wrote: > Indeed due to the poor quality of most websites it is not possible to be > 100% accurate for all sites. > > personally I would start by checking the doc type & then the meta data as > these should be quick & correct, I then use chardectect only if these > fail t

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Alister
On Mon, 24 Dec 2012 13:50:39 +, Steven D'Aprano wrote: > On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote: > >> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller >> wrote: >>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2 >>> with confidence 0.803579722043 $ >> >> And it

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Steven D'Aprano
On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote: > On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller > wrote: >> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2 >> with confidence 0.803579722043 $ > > And it sucks, because it uses magic, and not reading the HTML tags. The > RI

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Kwpolska
On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller wrote: > $ wget -q -O - http://python.org/ | chardetect.py > stdin: ISO-8859-2 with confidence 0.803579722043 > $ And it sucks, because it uses magic, and not reading the HTML tags. The RIGHT thing to do for websites is detect the meta charset definit

Re: how to detect the character encoding in a web page ?

2012-12-24 Thread Kurt Mueller
Am 24.12.2012 um 04:03 schrieb iMath: > but how to let python do it for you ? > such as these 2 pages > http://python.org/ > http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx > how to detect the character encoding in these 2 pages by python ? If you have the html code, let

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ but how to let python do it for you ? such as these 2 pages http://python.org/ http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).a

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ but how to let python do it for you ? such as these 2 pages http://python.org/ http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).a

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread iMath
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道: > how to detect the character encoding in a web page ? > > such as this page > > > > http://python.org/ but how to let python do it for you ? such as this page http://python.org/ how to detect the character encoding in this web page by python ?

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread Hans Mulder
On 24/12/12 01:34:47, iMath wrote: > how to detect the character encoding in a web page ? That depends on the site: different sites indicate their encoding differently. > such as this page: http://python.org/ If you download that page and look at the HTML code, you'll find a line: So it's

Re: how to detect the character encoding in a web page ?

2012-12-23 Thread Chris Angelico
On Mon, Dec 24, 2012 at 11:34 AM, iMath wrote: > how to detect the character encoding in a web page ? > such as this page > > http://python.org/ You read part-way into the page, where you find this: That tells you that the character set is UTF-8. ChrisA -- http://mail.python.org/mailman/lis