On Mar 8, 9:06 am, per <perfr...@gmail.com> wrote: > hi all, > > i have a program that essentially loops through a textfile file thats > about 800 MB in size containing tab separated data... my program > parses this file and stores its fields in a dictionary of lists. > > for line in file: > split_values = line.strip().split('\t')
line.strip() is NOT a very good idea because it strips all whitespace including tabs. line.rstrip('\n') is sufficient. BUT as Skip has pointed out, you should be using the csv module anyway. An 800Mb file is unlikely to have been written by Excel. Excel has this stupid idea of wrapping quotes around fields that contain commas (and quotes) even when the field delimiter is NOT a comma. Experiment: open Excel, enter the following 4 strings in cells A1:D1 normal embedded,comma "Hello"embedded-quote normality returns then save as Text (Tab-delimited). Here's what you get: | >>> open('Excel_tab_delimited.txt', 'rb').read() | 'normal\t"embedded,comma"\t"embedded""Hello""quote"\tnormality returns\r\n' | >>> > # do stuff with split_values > > currently, this is very slow in python, even if all i do is break up > each line using split() and store its values in a dictionary, indexing > by one of the tab separated values in the file. > > is this just an overhead of python that's inevitable? do you guys > think that switching to cython might speed this up, perhaps by > optimizing the main for loop? or is this not a viable option? You are unlikely to get much speed-up .. I'd expect that the loop overhead would be a tiny part of the execution time. -- http://mail.python.org/mailman/listinfo/python-list