[email protected] wrote:
HiI have an excel file that is read into python (8000 rows) from csv import reader, writer incsv = reader(open(MY_FILE), dialect='excel') keys = incsv.next() There are mixed datatypes. the last column contains a cumulative frequency running in order 0.0000 to 1.0000 for the 8000 rows for a loop of 100,000 times I want to take a new random number each time and find the row with the next biggest cumulative frequency value Here's my current (pseudo)code: for 1 to 100000 myRand = random.random() for line in incsv: if float(line[-1]) > myRand: resline = [] for item in line: try: i = int(item) except ValueError: try: i = float(item) except ValueError: i = item resline.append(i) #Here we construct a dict of pair values: #{'ID':18,... res = dict(zip(keys,resline)) break else: continue #do some stuff with res I'm scanning over each line of the csv and deciding which row to select (100k times) this is just not very efficient How can i improve this code. for line in incsv: if float(line[-1]) > random.random(): I can use numpy etc. whatever
Here's a suggestion: Construct the dicts for all the rows, stored in a list. Construct a list of just the cumulative frequencies. For each random value, use the bisect module to search for the value in the cumulative frequencies list. This will return an index which you can then use on your list of dicts. -- http://mail.python.org/mailman/listinfo/python-list
