Re: [Tutor] scraping and saving in file

2010-12-30 Thread Patrick Sabin

On 2010-12-29 10:54, Tommy Kaas wrote:

It works fine but besides # I also get spaces between the columns in the
text file. How do I avoid that?


You could use the new print-function and the sep keyword argument, e.g.:

from __future__ import print_function

f = open("filename", "w")
print("1", "2", "3", file=f, sep="")

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] scraping and saving in file SOLVED

2010-12-29 Thread Stefan Behnel

Peter Otten, 29.12.2010 13:45:

   File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 430, in encode
 return self.decode().encode(encoding)


Wow, that's evil.

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] scraping and saving in file SOLVED

2010-12-29 Thread Peter Otten
Tommy Kaas wrote:

> With Stevens help about writing and Peters help about import codecs - and
> when I used \r\n instead of \r to give me new lines everything worked. I
> just thought that \n would be necessary? Thanks.
> Tommy

Newline handling varies across operating systems. If you are on Windows and 
open a file in text mode your program sees plain "\n",  but the data stored 
on disk is "\r\n". Most other OSes don't mess with newlines.

If you always want "\r\n" you can rely on the csv module to write your data, 
but the drawback is that you have to encode the strings manually:

import csv
import urllib2 
from BeautifulSoup import BeautifulSoup 

html = urllib2.urlopen(
'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read()
soup = BeautifulSoup(html)

with open('tabeltest.txt', "wb") as f:
writer = csv.writer(f, delimiter="#")
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
writer.writerow([unicode(col.string).encode("utf-8")
 for col in cols])

PS: It took me some time to figure out how deal with beautifulsoup's flavour 
of unicode:

>>> import BeautifulSoup as bs
>>> s = bs.NavigableString(u"älpha")
>>> s
u'\xe4lpha'
>>> s.encode("utf-8")
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 430, in encode
return self.decode().encode(encoding)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 
0: ordinal not in range(128)
>>> unicode(s).encode("utf-8") # heureka
'\xc3\xa4lpha'


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] scraping and saving in file

2010-12-29 Thread Dave Angel

On 01/-10/-28163 02:59 PM, Tommy Kaas wrote:

Steven D'Aprano wrote:

But in your case, the best way is not to use print at all. You are writing

to a

file -- write to the file directly, don't mess about with print. Untested:


f = open('tabeltest.txt', 'w')
url = 'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm'
soup = BeautifulSoup(urllib2.urlopen(url).read())
rows = soup.findAll('tr')
for tr in rows:
  cols = tr.findAll('td')
  output = "#".join(cols[i].string for i in (0, 1, 2, 3))
  f.write(output + '\n')  # don't forget the newline after each row
f.close()


Steven, thanks for the advice.
I see the point. But now I have problems with the Danish characters. I get
this:

Traceback (most recent call last):
   File "C:/pythonlib/kursus/kommuner-regioner_ny.py", line 36, in
 f.write(output + '\n')  # don't forget the newline after each row
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position
5: ordinal not in range(128)

I have tried to add # -*- coding: utf-8 -*- to the top of the script, but It
doesn't help?

Tommy

The coding line only affects how characters in the source module are 
interpreted.  For each file you input or output, you need to also decide 
the encoding to use.  As Peter said, you probably need

codecs.open(filename, "w", encoding="utf-8")

DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] scraping and saving in file SOLVED

2010-12-29 Thread Tommy Kaas
With Stevens help about writing and Peters help about import codecs - and when 
I used \r\n instead of \r to give me new lines everything worked. 
I just thought that \n would be necessary?
Thanks.
Tommy

> -Oprindelig meddelelse-
> Fra: tutor-bounces+tommy.kaas=kaasogmulvad...@python.org
> [mailto:tutor-bounces+tommy.kaas=kaasogmulvad...@python.org] På
> vegne af Peter Otten
> Sendt: 29. december 2010 11:46
> Til: tutor@python.org
> Emne: Re: [Tutor] scraping and saving in file
> 
> Tommy Kaas wrote:
> 
> > I’m trying to learn basic web scraping and starting from scratch. I’m
> > using Activepython 2.6.6
> 
> > I have uploaded a simple table on my web page and try to scrape it and
> > will save the result in a text file. I will separate the columns in
> > the file with #.
> 
> > It works fine but besides # I also get spaces between the columns in
> > the text file. How do I avoid that?
> 
> > This is the script:
> 
> > import urllib2
> > from BeautifulSoup import BeautifulSoup f = open('tabeltest.txt', 'w')
> > soup =
> BeautifulSoup(urllib2.urlopen('http://www.kaasogmulvad.dk/unv/python/ta
> belte
> > st.htm').read())
> 
> > rows = soup.findAll('tr')
> 
> > for tr in rows:
> > cols = tr.findAll('td')
> > print >> f,
> > cols[0].string,'#',cols[1].string,'#',cols[2].string,'#',cols[3].strin
> > g
> >
> > f.close()
> 
> > And the text file looks like this:
> 
> > Kommunenr # Kommune # Region # Regionsnr
> > 101 # København # Hovedstaden # 1084
> > 147 # Frederiksberg # Hovedstaden # 1084
> > 151 # Ballerup # Hovedstaden # 1084
> > 153 # Brøndby # Hovedstaden # 1084
> 
> The print statement automatically inserts spaces, so you can either resort to
> the write method
> 
> for i in range(4):
> if i:
> f.write("#")
> f.write(cols[i].string)
> 
> which is a bit clumsy, or you build the complete line and then print it as a
> whole:
> 
> print >> f, "#".join(col.string for col in cols)
> 
> Note that you have non-ascii characters in your data -- I'm surprised that
> writing to a file works for you. I would expect that
> 
> import codecs
> f = codecs.open("tmp.txt", "w", encoding="utf-8")
> 
> is needed to successfully write your data to a file
> 
> Peter
> 
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] scraping and saving in file

2010-12-29 Thread Peter Otten
Tommy Kaas wrote:

> Steven D'Aprano wrote:
>> But in your case, the best way is not to use print at all. You are
>> writing
> to a
>> file -- write to the file directly, don't mess about with print.
>> Untested:
>> 
>> 
>> f = open('tabeltest.txt', 'w')
>> url = 'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm'
>> soup = BeautifulSoup(urllib2.urlopen(url).read())
>> rows = soup.findAll('tr')
>> for tr in rows:
>>  cols = tr.findAll('td')
>>  output = "#".join(cols[i].string for i in (0, 1, 2, 3))
>>  f.write(output + '\n')  # don't forget the newline after each row
>> f.close()
> 
> Steven, thanks for the advice.
> I see the point. But now I have problems with the Danish characters. I get
> this:
> 
> Traceback (most recent call last):
>   File "C:/pythonlib/kursus/kommuner-regioner_ny.py", line 36, in 
> f.write(output + '\n')  # don't forget the newline after each row
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
> position 5: ordinal not in range(128)
> 
> I have tried to add # -*- coding: utf-8 -*- to the top of the script, but
> It doesn't help?

The coding cookie only affects unicode string constants in the source code, 
it doesn't change how the unicode data coming from BeautifulSoup is handled.
As I suspected in my other post you have to convert your data to a specific 
encoding (I use UTF-8 below) before you can write it to a file:

import urllib2 
import codecs
from BeautifulSoup import BeautifulSoup 

html = urllib2.urlopen(
'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read()
soup = BeautifulSoup(html)

with codecs.open('tabeltest.txt', "w", encoding="utf-8") as f:
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
print >> f, "#".join(col.string for col in cols)

The with statement implicitly closes the file, so you can avoid f.close() at 
the end of the script.

Peter

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] scraping and saving in file

2010-12-29 Thread Tommy Kaas
Steven D'Aprano wrote:
> But in your case, the best way is not to use print at all. You are writing
to a
> file -- write to the file directly, don't mess about with print. Untested:
> 
> 
> f = open('tabeltest.txt', 'w')
> url = 'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm'
> soup = BeautifulSoup(urllib2.urlopen(url).read())
> rows = soup.findAll('tr')
> for tr in rows:
>  cols = tr.findAll('td')
>  output = "#".join(cols[i].string for i in (0, 1, 2, 3))
>  f.write(output + '\n')  # don't forget the newline after each row
> f.close()

Steven, thanks for the advice. 
I see the point. But now I have problems with the Danish characters. I get
this:

Traceback (most recent call last):
  File "C:/pythonlib/kursus/kommuner-regioner_ny.py", line 36, in 
f.write(output + '\n')  # don't forget the newline after each row
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in position
5: ordinal not in range(128)

I have tried to add # -*- coding: utf-8 -*- to the top of the script, but It
doesn't help?

Tommy


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] scraping and saving in file

2010-12-29 Thread Peter Otten
Tommy Kaas wrote:

> I’m trying to learn basic web scraping and starting from scratch. I’m
> using Activepython 2.6.6

> I have uploaded a simple table on my web page and try to scrape it and
> will save the result in a text file. I will separate the columns in the
> file with
> #.
 
> It works fine but besides # I also get spaces between the columns in the
> text file. How do I avoid that?

> This is the script:

> import urllib2
> from BeautifulSoup import BeautifulSoup
> f = open('tabeltest.txt', 'w')
> soup = 
BeautifulSoup(urllib2.urlopen('http://www.kaasogmulvad.dk/unv/python/tabelte
> st.htm').read())
 
> rows = soup.findAll('tr')

> for tr in rows:
> cols = tr.findAll('td')
> print >> f,
> cols[0].string,'#',cols[1].string,'#',cols[2].string,'#',cols[3].string
> 
> f.close()

> And the text file looks like this:

> Kommunenr # Kommune # Region # Regionsnr
> 101 # København # Hovedstaden # 1084
> 147 # Frederiksberg # Hovedstaden # 1084
> 151 # Ballerup # Hovedstaden # 1084
> 153 # Brøndby # Hovedstaden # 1084

The print statement automatically inserts spaces, so you can either resort 
to the write method

for i in range(4):
if i:
f.write("#")
f.write(cols[i].string)

which is a bit clumsy, or you build the complete line and then print it as a 
whole:

print >> f, "#".join(col.string for col in cols)

Note that you have non-ascii characters in your data -- I'm surprised that 
writing to a file works for you. I would expect that

import codecs
f = codecs.open("tmp.txt", "w", encoding="utf-8")

is needed to successfully write your data to a file

Peter

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] scraping and saving in file

2010-12-29 Thread Knacktus

Am 29.12.2010 10:54, schrieb Tommy Kaas:

Hi,

I’m trying to learn basic web scraping and starting from scratch. I’m
using Activepython 2.6.6

I have uploaded a simple table on my web page and try to scrape it and
will save the result in a text file. I will separate the columns in the
file with #.

It works fine but besides # I also get spaces between the columns in the
text file. How do I avoid that?

This is the script:

import urllib2

from BeautifulSoup import BeautifulSoup

f = open('tabeltest.txt', 'w')

soup =
BeautifulSoup(urllib2.urlopen('http://www.kaasogmulvad.dk/unv/python/tabeltest.htm').read())

rows = soup.findAll('tr')

for tr in rows:

 cols = tr.findAll('td')

 print >> f,
cols[0].string,'#',cols[1].string,'#',cols[2].string,'#',cols[3].string


You can strip the whitespaces from the strings. I assume the 
"string"-attribute returns a string (I don't now the API of Beautiful 
Soup) E.g.:

cols[0].string.strip()

Also, you can use join() to create the complete string:

resulting_string = "#".join([col.string.strip() for col in cols])

The long version without list comprehension (just for illustration, 
better use list comprehension):


resulting_string = "#".join([cols[0].string.strip(), 
cols[1].string.strip(), cols[2].string.strip(), cols[3].string.strip(), 
cols[4].string.strip()])


HTH,

Jan






f.close()

And the text file looks like this:

Kommunenr # Kommune # Region # Regionsnr

101 # København # Hovedstaden # 1084

147 # Frederiksberg # Hovedstaden # 1084

151 # Ballerup # Hovedstaden # 1084

153 # Brøndby # Hovedstaden # 1084

155 # Dragør # Hovedstaden # 1084

Thanks in advance

Tommy Kaas

Kaas & Mulvad

Lykkesholms Alle 2A, 3.

1902 Frederiksberg C

Mobil: 27268818

Mail: tommy.k...@kaasogmulvad.dk 

Web: www.kaasogmulvad.dk 



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] scraping and saving in file

2010-12-29 Thread Steven D'Aprano

Tommy Kaas wrote:


I have uploaded a simple table on my web page and try to scrape it and will
save the result in a text file. I will separate the columns in the file with
#.

It works fine but besides # I also get spaces between the columns in the
text file. How do I avoid that?


The print command puts spaces between the each output object:

>>> print 1, 2, 3  # Three objects being printed.
1 2 3

To prevent this, use a single output object. There are many ways to do 
this, here are three:


>>> print "%d%d%d" % (1, 2, 3)
123
>>> print str(1) + str(2) + str(3)
123
>>> print ''.join('%s' % n for n in (1, 2, 3))
123


But in your case, the best way is not to use print at all. You are 
writing to a file -- write to the file directly, don't mess about with 
print. Untested:



f = open('tabeltest.txt', 'w')
url = 'http://www.kaasogmulvad.dk/unv/python/tabeltest.htm'
soup = BeautifulSoup(urllib2.urlopen(url).read())
rows = soup.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
output = "#".join(cols[i].string for i in (0, 1, 2, 3))
f.write(output + '\n')  # don't forget the newline after each row
f.close()



--
Steven

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] scraping and saving in file

2010-12-29 Thread Tommy Kaas
Hi,

I’m trying to learn basic web scraping and starting from scratch. I’m using
Activepython 2.6.6

 

I have uploaded a simple table on my web page and try to scrape it and will
save the result in a text file. I will separate the columns in the file with
#.

It works fine but besides # I also get spaces between the columns in the
text file. How do I avoid that?

 

This is the script:

 

import urllib2 

from BeautifulSoup import BeautifulSoup 

 

f = open('tabeltest.txt', 'w')

 

soup =
BeautifulSoup(urllib2.urlopen('http://www.kaasogmulvad.dk/unv/python/tabelte
st.htm').read())

 

rows = soup.findAll('tr')

 

for tr in rows:

cols = tr.findAll('td')

print >> f,
cols[0].string,'#',cols[1].string,'#',cols[2].string,'#',cols[3].string

f.close()

 

And the text file looks like this:

 

Kommunenr # Kommune # Region # Regionsnr

101 # København # Hovedstaden # 1084

147 # Frederiksberg # Hovedstaden # 1084

151 # Ballerup # Hovedstaden # 1084

153 # Brøndby # Hovedstaden # 1084

155 # Dragør # Hovedstaden # 1084

 

Thanks in advance

 

Tommy Kaas

 

Kaas & Mulvad

Lykkesholms Alle 2A, 3.

1902 Frederiksberg C

 

Mobil: 27268818

Mail:   tommy.k...@kaasogmulvad.dk

Web: www.kaasogmulvad.dk

 

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor