Currently I am faced with a large computation tasks, which works on a huge CSV file. As a test I am working on a very small subset which already contains 2E6 records. The task itself allows the file to be split however as each computation only involves one line. The application performing the computation exists already, but it was never meant to run on such a big dataset.
One thing that is clear, is that it will take a while to compute all this. So a distributed approach is probably a good idea. There ar a couple of options for this: Scenario A ( file is split manually in smaller parts ): 1) Fire up an openmosix/kerrighed cluster, and run one process for each file part. Scenario B ( file is "split" using the application itself ): 2) Again with an openmosix/kerrighed cluster, but only one instance of the application is run, using parallelpython 3) Using parallelpython without cluster, but using ppserver.py on each node. The second case looks most interesting as it is quite flexible. In this case I would need to address subsets of the CSV file however. And the default csv.reader class does not allow random-access of the file (or jumping to a specific line). What would be the most efficient way to subset a CSV-file. For example: f1 = job_server.submit(calc_scores, datafile[0:1000]) f2 = job_server.submit(calc_scores, datafile[1001:2000]) f3 = job_server.submit(calc_scores, datafile[2001:3000]) ... and so on Obviously this won't work as you cannot access a slice of a csv-file. Would it be possible to subclass the csv.reader class in a way that you can somewhat efficiently access a slice? Jumping backwards is not really necessary, so it's not really random access. The obvious way is to do the following: buffer = [] for line in reader: buffer.append(line) if len(buffer) == 1000: f = job_server.submit(calc_scores, buffer) buffer = [] f = job_server.submit(calc_scores, buffer) buffer = [] but would this not kill my memory if I start loading bigger slices into the "buffer" variable? -- http://mail.python.org/mailman/listinfo/python-list