Re: [Tutor] A CSV field is a list of integers - how to read it as such?
Once a csv file has been read by a csv reader (such as DictReader), it's no longer a csv file. That was an Aha! moment for me. The file is on disk, each row of it is in memory as a list or dict, and it's the list or dict that matters. It's so obvious now. Thanks Dave. a namedtuple is probably exactly what you want. I read this as meaning that while tuples themselves are immutable, to effectively modify it I simply delete it and replace it with a new tuple with new values. Another 'Aha!' Trung ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A CSV field is a list of integers - how to read it as such?
On 03/06/2013 09:05 PM, DoanVietTrungAtGmail wrote: Once a csv file has been read by a csv reader (such as DictReader), it's no longer a csv file. That was an Aha! moment for me. The file is on disk, each row of it is in memory as a list or dict, and it's the list or dict that matters. It's so obvious now. Thanks Dave. a namedtuple is probably exactly what you want. I read this as meaning that while tuples themselves are immutable, to effectively modify it I simply delete it and replace it with a new tuple with new values. Another 'Aha!' A collections.namedtuple is not the same as a tuple. The items in it can be addressed either by index, or by name. But each instance of the tuple does *not* waste space duplicating those names. The names are stored once per type of namedtuple http://docs.python.org/2/library/collections.html#collections.namedtuple Look at the example following the comment: Named tuples are especially useful for assigning field names to result tuples returned by the csv or sqlite3 modules: And an instance created through collections.namedtuple has a useful method: somenamedtuple._replace(kwargs) Return a new instance of the named tuple replacing specified fields with new values: -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A CSV field is a list of integers - how to read it as such?
On 03/04/2013 01:48 AM, DoanVietTrungAtGmail wrote: Don, Dave - Thanks for your help! Don: Thanks! I've just browsed the AST documentation, much of it goes over my head, but the ast.literal_eval helper function works beautifully for me. Dave: Again, thanks! Also, you asked More space efficient than what? I meant .csv versus dict, list, and objects. Specifically, if I read a 10-million row .csv file into RAM, how is its RAM footprint compared to a list or dict containing 10M equivalent items, or to 10M equivalent class instances living in RAM. Once a csv file has been read by a csv reader (such as DictReader), it's no longer a csv file. The data in memory never exists as a copy of the file on disk. The way you wrote the code, each row exists as a dict of strings, but more commonly, each row would exist as a list of strings. The csv logic does not keep more than one row at a time, so if you want a big list to exist at one time, you'll be making one yourself. (Perhaps by using append inside the loop instead of the print you're doing now). So the question is not how much RAM does the csvdata take up, but how much RAM is used by whatever form you use. In that, you shouldn't worry about the overhead of the list, but the overhead of however you store each individual row. When a list overallocates, the unused rows each take up 4 or 8 bytes, as opposed to probably thousands of bytes for each row that is used. I've just tested and learned that a .csv file has very little overhead, in the order of bytes not KB. Presumably the same applies when the file is read into RAM. As to the RAM overheads of dict, list, and class instances, I've just found some stackoverflow discussions. Onehttp://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memorysays that for large lists in CPython, the overallocation is 12.5 percent. So the first question is whether you really need the data to all be instantly addressable in RAM at one time. If you can do all your processing a row at a time, then the problem goes away. Assuming you do need random access to the rows, then the next thing to consider is whether a dict is the best way to describe the columns. Since every dict has the same keys, and since they're presumably known to your source code, then a custom class for the row is probably better, and a namedtuple is probably exactly what you want. There is then no overhead for the names of the columns, and the elements of the tuple are either ints or lists of ints. If that's not compact enough, then the next thing to consider is how you store those ints. If there's lots of them, and especially if you can constrain how big the largest is, then you could use the array module. It assumes all the numeric items are limited to a particular size, and you can specify that size. For example, if all the ints are nonnegative and less than 256, you could do: import array myarray = array.array('b', mylist) An array is somewhat slower than a list, but it holds lots more integers in a given space. Since ram size is your concern, the fact that you happen to serialize it into a csv is irrelevant. That's a good choice if you want to be able to examine the data in a text editor, or import it into a spreadsheet. If you have other requirements, we can figure them out in a separate question. -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A CSV field is a list of integers - how to read it as such?
On 05/03/13 00:24, Dave Angel wrote: import array myarray = array.array('b', mylist) An array is somewhat slower than a list, I think that it's true that using a list *can* be faster, but that's only because we're comparing apples with oranges. Arrays do more work than lists. For example, using Python 2.7 constructing an array is much slower than constructing a list: [steve@ando ~]$ python -m timeit -s from array import array array('b', xrange(100)) 10 loops, best of 3: 19.1 usec per loop [steve@ando ~]$ python -m timeit list(xrange(100)) 10 loops, best of 3: 3.26 usec per loop but that's only because the list code doesn't perform the same range checking as the array does. If we add range checking ourselves, we see very different results: [steve@ando ~]$ python -m timeit list(x for x in xrange(100) if 0 = x 256) 1 loops, best of 3: 27.4 usec per loop So I would suggest that constructing an array is significantly faster than constructing a restricted list. Here's another example: summing a list versus an array. In this specific example, there is a small but consistent advantage to lists, but probably not one that's worth caring about: [steve@ando ~]$ python -m timeit -s arr = range(100) sum(arr) 10 loops, best of 3: 2.34 usec per loop [steve@ando ~]$ python -m timeit -s from array import array -s arr = array('b', range(100)) sum(arr) 10 loops, best of 3: 2.78 usec per loop -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A CSV field is a list of integers - how to read it as such?
On Mar 3, 2013, at 9:24 PM, DoanVietTrungAtGmail wrote: Dear tutors I am checking out csv as a possible data structure for my records. In each record, some fields are an integer and some are a list of integers of variable length. I use csv.DictWriter to write data. When reading out using csv.DictReader, each row is read as a string, per the csv module's standard behaviour. To get these columns as lists of integers, I can think of only a multi-step process: first, remove the brackets enclosing the string; second, split the string into a list containing substrings; third, convert each substring into an integer. This process seems inelegant. Is there a better way to get integers and lists of integers from a csv file? A quick search for python list object from string representation of list returned an idea from stackoverflow which I have adapted for integers: import ast num_string = [1, 2, 3] ast.literal_eval(num_string) [1, 2, 3] Take care, Don ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A CSV field is a list of integers - how to read it as such?
On 03/03/2013 09:24 PM, DoanVietTrungAtGmail wrote: Dear tutors I am checking out csv as a possible data structure for my records. In each record, some fields are an integer and some are a list of integers of variable length. I use csv.DictWriter to write data. When reading out using csv.DictReader, each row is read as a string, per the csv module's standard behaviour. To get these columns as lists of integers, I can think of only a multi-step process: first, remove the brackets enclosing the string; second, split the string into a list containing substrings; third, convert each substring into an integer. This process seems inelegant. Is there a better way to get integers and lists of integers from a csv file? Or, is a csv file simply not the best data structure given the above requirement? Your terminology is very confusing. A csv is not a data structure, it's a method of serializing lists of strings. Or in this case dicts of strings. If a particular dict value isn't a string, it'll get converted to one implicitly. csv does not handle variable length records, so this is close to the best you're going to do. Apart from csv, I considered using a dict or list, or using an object to represent each row. Objects don't exist in a file, so they don't persist between multiple runs of the program. Likewise dict and list. So no idea what you really meant. I am being attracted to csv because csv means serialisation is unnecessary, I just need to close and open the file to stop and continue later (it's a simulation experiment). Closing and opening don't do anything to persist data, but we can guess you must have meant to imply reading and writing as well. And you've nicely finessed the serialization in the write step, but as you discovered, you'll have to handle the deserialization to get back to ints and list. Also, I am guessing but haven't checked, csv is more space efficient. More space efficient than what? Each row contains a few integers plus a few lists containing hundreds of integers, and there will be up to hundreds of millions of rows. CODE: My Python 2.7 code is below. It doesn't have the third step (substring - int). import csv record1 = {'id':1, 'type':1, 'level':1, 'ListInRecord':[2, 9]} record2 = {'id':2, 'type':1, 'level':1, 'ListInRecord':[1, 9]} record3 = {'id':3, 'type':2, 'level':1, 'ListInRecord':[2]} record9 = {'id':9, 'type':3, 'level':0, 'ListInRecord':[]} rows = [record1, record2, record3, record9] header = ['id', 'type', 'level', 'ListInRecord'] with open('testCSV.csv', 'wb') as f: fCSV = csv.DictWriter(f, header) fCSV.writeheader() fCSV.writerows(rows) with open('testCSV.csv', 'r') as f: fCSV = csv.DictReader(f) for row in fCSV: I'd add the deserialization here. For each item in row, if the value begins and ends with [ ] then make it into a list, and if a digit or minus-sign, make it into an int. Then for the lists, convert each element to an int. You can use Don Jennings suggestion to save a lost of effort here. This should reconstruct the original recordn precisely. But it'll take some testing to be sure. print 'ID=', row['id'],'ListInRecord=', row['ListInRecord'][1:-1].split(', ') # I want this to be a list of integers, NOT list of strings OUTPUT: ID= 1 ListInRecord= ['2', '9'] ID= 2 ListInRecord= ['1', '9'] ID= 3 ListInRecord= ['2'] ID= 9 ListInRecord= [''] -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] A CSV field is a list of integers - how to read it as such?
Don, Dave - Thanks for your help! Don: Thanks! I've just browsed the AST documentation, much of it goes over my head, but the ast.literal_eval helper function works beautifully for me. Dave: Again, thanks! Also, you asked More space efficient than what? I meant .csv versus dict, list, and objects. Specifically, if I read a 10-million row .csv file into RAM, how is its RAM footprint compared to a list or dict containing 10M equivalent items, or to 10M equivalent class instances living in RAM. I've just tested and learned that a .csv file has very little overhead, in the order of bytes not KB. Presumably the same applies when the file is read into RAM. As to the RAM overheads of dict, list, and class instances, I've just found some stackoverflow discussions. Onehttp://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memorysays that for large lists in CPython, the overallocation is 12.5 percent. Trung Doan On Mon, Mar 4, 2013 at 2:12 PM, Dave Angel da...@davea.name wrote: On 03/03/2013 09:24 PM, DoanVietTrungAtGmail wrote: Dear tutors I am checking out csv as a possible data structure for my records. In each record, some fields are an integer and some are a list of integers of variable length. I use csv.DictWriter to write data. When reading out using csv.DictReader, each row is read as a string, per the csv module's standard behaviour. To get these columns as lists of integers, I can think of only a multi-step process: first, remove the brackets enclosing the string; second, split the string into a list containing substrings; third, convert each substring into an integer. This process seems inelegant. Is there a better way to get integers and lists of integers from a csv file? Or, is a csv file simply not the best data structure given the above requirement? Your terminology is very confusing. A csv is not a data structure, it's a method of serializing lists of strings. Or in this case dicts of strings. If a particular dict value isn't a string, it'll get converted to one implicitly. csv does not handle variable length records, so this is close to the best you're going to do. Apart from csv, I considered using a dict or list, or using an object to represent each row. Objects don't exist in a file, so they don't persist between multiple runs of the program. Likewise dict and list. So no idea what you really meant. I am being attracted to csv because csv means serialisation is unnecessary, I just need to close and open the file to stop and continue later (it's a simulation experiment). Closing and opening don't do anything to persist data, but we can guess you must have meant to imply reading and writing as well. And you've nicely finessed the serialization in the write step, but as you discovered, you'll have to handle the deserialization to get back to ints and list. Also, I am guessing but haven't checked, csv is more space efficient. More space efficient than what? Each row contains a few integers plus a few lists containing hundreds of integers, and there will be up to hundreds of millions of rows. CODE: My Python 2.7 code is below. It doesn't have the third step (substring - int). import csv record1 = {'id':1, 'type':1, 'level':1, 'ListInRecord':[2, 9]} record2 = {'id':2, 'type':1, 'level':1, 'ListInRecord':[1, 9]} record3 = {'id':3, 'type':2, 'level':1, 'ListInRecord':[2]} record9 = {'id':9, 'type':3, 'level':0, 'ListInRecord':[]} rows = [record1, record2, record3, record9] header = ['id', 'type', 'level', 'ListInRecord'] with open('testCSV.csv', 'wb') as f: fCSV = csv.DictWriter(f, header) fCSV.writeheader() fCSV.writerows(rows) with open('testCSV.csv', 'r') as f: fCSV = csv.DictReader(f) for row in fCSV: I'd add the deserialization here. For each item in row, if the value begins and ends with [ ] then make it into a list, and if a digit or minus-sign, make it into an int. Then for the lists, convert each element to an int. You can use Don Jennings suggestion to save a lost of effort here. This should reconstruct the original recordn precisely. But it'll take some testing to be sure. print 'ID=', row['id'],'ListInRecord=', row['ListInRecord'][1:-1].**split(', ') # I want this to be a list of integers, NOT list of strings OUTPUT: ID= 1 ListInRecord= ['2', '9'] ID= 2 ListInRecord= ['1', '9'] ID= 3 ListInRecord= ['2'] ID= 9 ListInRecord= [''] -- DaveA __**_ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/**mailman/listinfo/tutorhttp://mail.python.org/mailman/listinfo/tutor On Mon, Mar 4, 2013 at 2:12 PM, Dave Angel da...@davea.name wrote: On 03/03/2013 09:24 PM, DoanVietTrungAtGmail wrote: Dear tutors I am checking out csv as a possible data structure for my records. In each
Re: [Tutor] A CSV field is a list of integers - how to read it as such?
On 04/03/13 17:48, DoanVietTrungAtGmail wrote: Don, Dave - Thanks for your help! Don: Thanks! I've just browsed the AST documentation, much of it goes over my head, but the ast.literal_eval helper function works beautifully for me. Dave: Again, thanks! Also, you asked More space efficient than what? I meant .csv versus dict, list, and objects. Specifically, if I read a 10-million row .csv file into RAM, how is its RAM footprint compared to a list or dict containing 10M equivalent items, or to 10M equivalent class instances living in RAM. I've just tested and learned that a .csv file has very little overhead, in the order of bytes not KB. Presumably the same applies when the file is read into RAM. How many items per row? How many characters per item? CSV files are just text files. So they'll take as much memory as they have characters, multiplied by the number of bytes per character, e.g.: ASCII or Latin-1: 1 byte per character UTC-16: 2 bytes per character UTC-32: 4 bytes per character UTF-8: variable, depends on the characters but typically close to 1 byte for Western-European text. Suppose you have CSV stored in UTC-16, 10-million rows, with 1 hundred columns per row, and each column averages 30 characters, giving approximately 6200 bytes per row, or 62 gigabytes in total. That's a pretty big file. Does your computer have 62 GB of memory? If not, you're going to have a bit of trouble reading in the entire file all at once... But if you process only one row at a time, you only have to handle about 6.2 KB per row at a time. When that gets converted into a list of strings, that will take about 24 KB. As to the RAM overheads of dict, list, and class instances, I've just found some stackoverflow discussions. Onehttp://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memorysays that for large lists in CPython, the overallocation is 12.5 percent. Yes. Do you have a question about it? -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor