Re: [Tutor] A CSV field is a list of integers - how to read it as such?

2013-03-06 Thread DoanVietTrungAtGmail
 Once a csv file has been read by a csv reader (such as DictReader), it's
 no longer a csv file.


That was an Aha! moment for me. The file is on disk, each row of it is in
memory as a list or dict, and it's the list or dict that matters. It's so
obvious now. Thanks Dave.



a namedtuple is probably exactly what you want.


I read this as meaning that while tuples themselves are immutable, to
effectively modify it I simply delete it and replace it with a new tuple
with new values. Another 'Aha!'

Trung
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A CSV field is a list of integers - how to read it as such?

2013-03-06 Thread Dave Angel

On 03/06/2013 09:05 PM, DoanVietTrungAtGmail wrote:

Once a csv file has been read by a csv reader (such as DictReader), it's
no longer a csv file.



That was an Aha! moment for me. The file is on disk, each row of it is in
memory as a list or dict, and it's the list or dict that matters. It's so
obvious now. Thanks Dave.





a namedtuple is probably exactly what you want.


I read this as meaning that while tuples themselves are immutable, to
effectively modify it I simply delete it and replace it with a new tuple
with new values. Another 'Aha!'



A collections.namedtuple is not the same as a tuple.  The items in it 
can be addressed either by index, or by name.  But each instance of the 
tuple does *not* waste space duplicating those names.  The names are 
stored once per type of namedtuple


http://docs.python.org/2/library/collections.html#collections.namedtuple

Look at the example following the comment:
 Named tuples are especially useful for assigning field names to 
result tuples returned by the csv or sqlite3 modules:


And an instance created through collections.namedtuple has a useful method:

somenamedtuple._replace(kwargs)
Return a new instance of the named tuple replacing specified fields with 
new values:


--
DaveA
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A CSV field is a list of integers - how to read it as such?

2013-03-04 Thread Dave Angel

On 03/04/2013 01:48 AM, DoanVietTrungAtGmail wrote:

Don, Dave - Thanks for your help!

Don: Thanks! I've just browsed the AST documentation, much of it goes over
my head, but the ast.literal_eval helper function works beautifully for me.

Dave: Again, thanks! Also, you asked More space efficient than what? I
meant .csv versus dict, list, and objects. Specifically, if I read a
10-million row .csv file into RAM, how is its RAM footprint compared to a
list or dict containing 10M equivalent items, or to 10M equivalent class
instances living in RAM.


Once a csv file has been read by a csv reader (such as DictReader), it's 
no longer a csv file.  The data in memory never exists as a copy of the 
file on disk.  The way you wrote the code, each row exists as a dict of 
strings, but more commonly, each row would exist as a list of strings.


The csv logic does not keep more than one row at a time, so if you want 
a big list to exist at one time, you'll be making one yourself. 
(Perhaps by using append inside the loop instead of the print you're 
doing now).


So the question is not how much RAM does the csvdata take up, but how 
much RAM is used by whatever form you use.  In that, you shouldn't worry 
about the overhead of the list, but the overhead of however you store 
each individual row.  When a list overallocates, the unused rows each 
take up 4 or 8 bytes, as opposed to probably thousands of bytes for each 
row that is used.


 I've just tested and learned that a .csv file has

very little overhead, in the order of bytes not KB. Presumably the same
applies when the file is read into RAM.

As to the RAM overheads of dict, list, and class instances, I've just found
some stackoverflow discussions.
Onehttp://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memorysays
that for large lists in CPython, the
overallocation is 12.5 percent.



So the first question is whether you really need the data to all be 
instantly addressable in RAM at one time.  If you can do all your 
processing a row at a time, then the problem goes away.


Assuming you do need random access to the rows, then the next thing to 
consider is whether a dict is the best way to describe the columns. 
Since every dict has the same keys, and since they're presumably known 
to your source code, then a custom class for the row is probably better, 
and a namedtuple is probably exactly what you want. There is then no 
overhead for the names of the columns, and the elements of the tuple are 
either ints or lists of ints.


If that's not compact enough, then the next thing to consider is how you 
store those ints.  If there's lots of them, and especially if you can 
constrain how big the largest is, then you could use the array module. 
It assumes all the numeric items are limited to a particular size, and 
you can specify that size.  For example, if all the ints are nonnegative 
and less than 256, you could do:


import array
myarray = array.array('b', mylist)

An array is somewhat slower than a list, but it holds lots more integers 
in a given space.


Since ram size is your concern, the fact that you happen to serialize it 
into a csv is irrelevant.  That's a good choice if you want to be able 
to examine the data in a text editor, or import it into a spreadsheet. 
If you have other requirements, we can figure them out in a separate 
question.


--
DaveA
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A CSV field is a list of integers - how to read it as such?

2013-03-04 Thread Steven D'Aprano

On 05/03/13 00:24, Dave Angel wrote:

import array
myarray = array.array('b', mylist)

An array is somewhat slower than a list,



I think that it's true that using a list *can* be faster, but that's only 
because we're comparing apples with oranges. Arrays do more work than lists. 
For example, using Python 2.7 constructing an array is much slower than 
constructing a list:

[steve@ando ~]$ python -m timeit -s from array import array array('b', 
xrange(100))
10 loops, best of 3: 19.1 usec per loop

[steve@ando ~]$ python -m timeit list(xrange(100))
10 loops, best of 3: 3.26 usec per loop


but that's only because the list code doesn't perform the same range checking 
as the array does. If we add range checking ourselves, we see very different 
results:

[steve@ando ~]$ python -m timeit list(x for x in xrange(100) if 0 = x  256)
1 loops, best of 3: 27.4 usec per loop


So I would suggest that constructing an array is significantly faster than 
constructing a restricted list.



Here's another example: summing a list versus an array. In this specific 
example, there is a small but consistent advantage to lists, but probably not 
one that's worth caring about:


[steve@ando ~]$ python -m timeit -s arr = range(100) sum(arr)
10 loops, best of 3: 2.34 usec per loop

[steve@ando ~]$ python -m timeit -s from array import array -s arr = array('b', 
range(100)) sum(arr)
10 loops, best of 3: 2.78 usec per loop



--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A CSV field is a list of integers - how to read it as such?

2013-03-03 Thread Don Jennings

On Mar 3, 2013, at 9:24 PM, DoanVietTrungAtGmail wrote:

 Dear tutors
 
 I am checking out csv as a possible data structure for my records. In each 
 record, some fields are an integer and some are a list of integers of 
 variable length. I use csv.DictWriter to write data. When reading out using 
 csv.DictReader, each row is read as a string, per the csv module's standard 
 behaviour. To get these columns as lists of integers, I can think of only a 
 multi-step process: first, remove the brackets enclosing the string; second, 
 split the string into a list containing substrings; third, convert  each 
 substring into an integer. This process seems inelegant. Is there a better 
 way to get integers and lists of integers from a csv file?

A quick search for python list object from string representation of list 
returned an idea from stackoverflow which I have adapted for integers:

 import ast
 num_string = [1, 2, 3]
 ast.literal_eval(num_string)
[1, 2, 3]


Take care,
Don

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A CSV field is a list of integers - how to read it as such?

2013-03-03 Thread Dave Angel

On 03/03/2013 09:24 PM, DoanVietTrungAtGmail wrote:

Dear tutors

I am checking out csv as a possible data structure for my records. In each
record, some fields are an integer and some are a list of integers of
variable length. I use csv.DictWriter to write data. When reading out using
csv.DictReader, each row is read as a string, per the csv module's standard
behaviour. To get these columns as lists of integers, I can think of only a
multi-step process: first, remove the brackets enclosing the string;
second, split the string into a list containing substrings; third, convert
  each substring into an integer. This process seems inelegant. Is there a
better way to get integers and lists of integers from a csv file?

Or, is a csv file simply not the best data structure given the above
requirement?


Your terminology is very confusing.  A csv is not a data structure, it's 
a method of serializing lists of strings.  Or in this case dicts of 
strings.  If a particular dict value isn't a string, it'll get converted 
to one implicitly.  csv does not handle variable length records, so this 
is close to the best you're going to do.


 Apart from csv, I considered using a dict or list, or using an

object to represent each row.


Objects don't exist in a file, so they don't persist between multiple 
runs of the program.  Likewise dict and list.  So no idea what you 
really meant.


 I am being attracted to csv because csv means

serialisation is unnecessary, I just need to close and open the file to
stop and continue later (it's a simulation experiment).


Closing and opening don't do anything to persist data, but we can guess 
you must have meant to imply reading and writing as well.  And you've 
nicely finessed the serialization in the write step, but as you 
discovered, you'll have to handle the deserialization to get back to 
ints and list.


 Also, I am guessing

but haven't checked, csv is more space efficient.


More space efficient than what?

 Each row contains a few

integers plus a few lists containing hundreds of integers, and there will
be up to hundreds of millions of rows.

CODE: My Python 2.7 code is below. It doesn't have the third step
(substring - int).

import csv

record1 = {'id':1, 'type':1, 'level':1, 'ListInRecord':[2, 9]}
record2 = {'id':2, 'type':1, 'level':1, 'ListInRecord':[1, 9]}
record3 = {'id':3, 'type':2, 'level':1, 'ListInRecord':[2]}
record9 = {'id':9, 'type':3, 'level':0, 'ListInRecord':[]}
rows = [record1, record2, record3, record9]
header = ['id', 'type', 'level', 'ListInRecord']

with open('testCSV.csv', 'wb') as f:
 fCSV = csv.DictWriter(f, header)
 fCSV.writeheader()
 fCSV.writerows(rows)

with open('testCSV.csv', 'r') as f:
 fCSV = csv.DictReader(f)
 for row in fCSV:


 I'd add the deserialization here. For each item in row, if the 
value begins and ends with [ ]  then make it into a list, and if a digit 
or minus-sign, make it into an int.  Then for the lists, convert each 
element to an int.  You can use Don Jennings suggestion to save a lost 
of effort here.


This should reconstruct  the original recordn precisely.  But it'll take 
some testing to be sure.



 print 'ID=', row['id'],'ListInRecord=',
row['ListInRecord'][1:-1].split(', ') # I want this to be a list of
integers, NOT list of strings

OUTPUT:

ID= 1 ListInRecord= ['2', '9']
ID= 2 ListInRecord= ['1', '9']
ID= 3 ListInRecord= ['2']
ID= 9 ListInRecord= ['']




--
DaveA
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] A CSV field is a list of integers - how to read it as such?

2013-03-03 Thread DoanVietTrungAtGmail
Don, Dave - Thanks for your help!

Don: Thanks! I've just browsed the AST documentation, much of it goes over
my head, but the ast.literal_eval helper function works beautifully for me.

Dave: Again, thanks! Also, you asked More space efficient than what? I
meant .csv versus dict, list, and objects. Specifically, if I read a
10-million row .csv file into RAM, how is its RAM footprint compared to a
list or dict containing 10M equivalent items, or to 10M equivalent class
instances living in RAM. I've just tested and learned that a .csv file has
very little overhead, in the order of bytes not KB. Presumably the same
applies when the file is read into RAM.

As to the RAM overheads of dict, list, and class instances, I've just found
some stackoverflow discussions.
Onehttp://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memorysays
that for large lists in CPython, the
overallocation is 12.5 percent.

Trung Doan


On Mon, Mar 4, 2013 at 2:12 PM, Dave Angel da...@davea.name wrote:

 On 03/03/2013 09:24 PM, DoanVietTrungAtGmail wrote:

 Dear tutors

 I am checking out csv as a possible data structure for my records. In each
 record, some fields are an integer and some are a list of integers of
 variable length. I use csv.DictWriter to write data. When reading out
 using
 csv.DictReader, each row is read as a string, per the csv module's
 standard
 behaviour. To get these columns as lists of integers, I can think of only
 a
 multi-step process: first, remove the brackets enclosing the string;
 second, split the string into a list containing substrings; third, convert
   each substring into an integer. This process seems inelegant. Is there a
 better way to get integers and lists of integers from a csv file?

 Or, is a csv file simply not the best data structure given the above
 requirement?


 Your terminology is very confusing.  A csv is not a data structure, it's a
 method of serializing lists of strings.  Or in this case dicts of strings.
  If a particular dict value isn't a string, it'll get converted to one
 implicitly.  csv does not handle variable length records, so this is close
 to the best you're going to do.


  Apart from csv, I considered using a dict or list, or using an

 object to represent each row.


 Objects don't exist in a file, so they don't persist between multiple runs
 of the program.  Likewise dict and list.  So no idea what you really meant.


  I am being attracted to csv because csv means

 serialisation is unnecessary, I just need to close and open the file to
 stop and continue later (it's a simulation experiment).


 Closing and opening don't do anything to persist data, but we can guess
 you must have meant to imply reading and writing as well.  And you've
 nicely finessed the serialization in the write step, but as you discovered,
 you'll have to handle the deserialization to get back to ints and list.


  Also, I am guessing

 but haven't checked, csv is more space efficient.


 More space efficient than what?


  Each row contains a few

 integers plus a few lists containing hundreds of integers, and there will
 be up to hundreds of millions of rows.

 CODE: My Python 2.7 code is below. It doesn't have the third step
 (substring - int).

 import csv

 record1 = {'id':1, 'type':1, 'level':1, 'ListInRecord':[2, 9]}
 record2 = {'id':2, 'type':1, 'level':1, 'ListInRecord':[1, 9]}
 record3 = {'id':3, 'type':2, 'level':1, 'ListInRecord':[2]}
 record9 = {'id':9, 'type':3, 'level':0, 'ListInRecord':[]}
 rows = [record1, record2, record3, record9]
 header = ['id', 'type', 'level', 'ListInRecord']

 with open('testCSV.csv', 'wb') as f:
  fCSV = csv.DictWriter(f, header)
  fCSV.writeheader()
  fCSV.writerows(rows)

 with open('testCSV.csv', 'r') as f:
  fCSV = csv.DictReader(f)
  for row in fCSV:


  I'd add the deserialization here. For each item in row, if the value
 begins and ends with [ ]  then make it into a list, and if a digit or
 minus-sign, make it into an int.  Then for the lists, convert each element
 to an int.  You can use Don Jennings suggestion to save a lost of effort
 here.

 This should reconstruct  the original recordn precisely.  But it'll take
 some testing to be sure.


  print 'ID=', row['id'],'ListInRecord=',
 row['ListInRecord'][1:-1].**split(', ') # I want this to be a list of
 integers, NOT list of strings

 OUTPUT:

 ID= 1 ListInRecord= ['2', '9']
 ID= 2 ListInRecord= ['1', '9']
 ID= 3 ListInRecord= ['2']
 ID= 9 ListInRecord= ['']



 --
 DaveA
 __**_
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/**mailman/listinfo/tutorhttp://mail.python.org/mailman/listinfo/tutor



On Mon, Mar 4, 2013 at 2:12 PM, Dave Angel da...@davea.name wrote:

 On 03/03/2013 09:24 PM, DoanVietTrungAtGmail wrote:

 Dear tutors

 I am checking out csv as a possible data structure for my records. In each

Re: [Tutor] A CSV field is a list of integers - how to read it as such?

2013-03-03 Thread Steven D'Aprano

On 04/03/13 17:48, DoanVietTrungAtGmail wrote:

Don, Dave - Thanks for your help!

Don: Thanks! I've just browsed the AST documentation, much of it goes over
my head, but the ast.literal_eval helper function works beautifully for me.

Dave: Again, thanks! Also, you asked More space efficient than what? I
meant .csv versus dict, list, and objects. Specifically, if I read a
10-million row .csv file into RAM, how is its RAM footprint compared to a
list or dict containing 10M equivalent items, or to 10M equivalent class
instances living in RAM. I've just tested and learned that a .csv file has
very little overhead, in the order of bytes not KB. Presumably the same
applies when the file is read into RAM.


How many items per row? How many characters per item?

CSV files are just text files. So they'll take as much memory as they have
characters, multiplied by the number of bytes per character, e.g.:

ASCII or Latin-1: 1 byte per character

UTC-16: 2 bytes per character

UTC-32: 4 bytes per character

UTF-8: variable, depends on the characters but typically close to 1 byte for
Western-European text.


Suppose you have CSV stored in UTC-16, 10-million rows, with 1 hundred columns
per row, and each column averages 30 characters, giving approximately 6200
bytes per row, or 62 gigabytes in total. That's a pretty big file. Does your
computer have 62 GB of memory? If not, you're going to have a bit of trouble
reading in the entire file all at once...

But if you process only one row at a time, you only have to handle about 6.2 KB
per row at a time. When that gets converted into a list of strings, that will
take about 24 KB.




As to the RAM overheads of dict, list, and class instances, I've just found
some stackoverflow discussions.
Onehttp://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memorysays
that for large lists in CPython, the overallocation is 12.5 percent.



Yes. Do you have a question about it?




--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor