Re: Efficient multi-slicing technique?
I'm not sure if it's more efficient, but there's the struct module: http://docs.python.org/library/struct.html Thanks for your suggestion. I've been experimenting with this technique, but my initial tests don't show any performance improvements over using slice() objects to slice a string. However, I missed the nuance of using 'x' to mark filler bytes - I'm going to see if this makes a difference (it may as I am skipping over several columns of input that I've been currently returning as ignored values) I don't expect it will make a great deal of difference -- there's not much room to improve the process. Are you actually experiencing efficiency problems? I regularly use slice unpacking (without reaching for the struct module) with no noteworthy performance impact beyond the cost of scanning the file and doing the processing on those lines (and these are text files several hundred megs in size). When I omit my processing code and just skim through the file, the difference between slice-unpacking and not slice-unpacking is in the sub-second range. wait ... it looks like I can 'compile' struct strings using by using a Struct class vs. the using the module's basic unpack() function. This sounds like the difference between using compiled regular expressions vs. re-compiling a regular expression on every use. I'll see if this makes a difference and report back to the list. I don't expect it will...in the code for the struct.py I've got here in my 2.5 distribution, it maintains an internal cache of compiled strings, so unless you have more than _MAXCACHE=100 formatting strings, it's not something you're really have to worry about. (in my main data-processing/ETL app, I can't envision having more than about 20 formatting strings, if I went that route) -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: Efficient multi-slicing technique?
Tim, > I'm not sure if it's more efficient, but there's the struct module: > http://docs.python.org/library/struct.html Thanks for your suggestion. I've been experimenting with this technique, but my initial tests don't show any performance improvements over using slice() objects to slice a string. However, I missed the nuance of using 'x' to mark filler bytes - I'm going to see if this makes a difference (it may as I am skipping over several columns of input that I've been currently returning as ignored values) wait ... it looks like I can 'compile' struct strings using by using a Struct class vs. the using the module's basic unpack() function. This sounds like the difference between using compiled regular expressions vs. re-compiling a regular expression on every use. I'll see if this makes a difference and report back to the list. Regards, Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: Efficient multi-slicing technique?
Is there an efficient way to multi-slice a fixed with string into individual fields that's logically equivalent to the way one would slice a delimited string using .split()? Background: I'm parsing some very large, fixed line-width text files that have weekly columns of data (52 data columns plus related data). My current strategy is to loop through a list of slice()'s to build a list of the specific field values for each line. This is fine for small files, but seems inefficient. I'm hoping that there's a built-in (C based) I'm not sure if it's more efficient, but there's the struct module[1]: from struct import unpack for line in file('sample.txt'): (num, a, b, c, nl) = unpack("2s9s7s4sc", line) print "num:", repr(num) print "a:", repr(a) print "b:", repr(b) print "c:", repr(c) Adjust the formatting string for your data (the last "c" is the newline character -- you might be able to use "x" here to just ignore the byte so it doesn't get returned). The sample data I threw was 2/9/7/4 character data. The general pattern would be lengths = [3,18,24,5,1,8] FORMAT_STR = ( ''.join("%ss" % length for length in lengths) + 'c') for line in file(INFILE): (f1, f2,..., fn, _) = unpack(FORMAT_STR, line) -tkc [1] http://docs.python.org/library/struct.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Efficient multi-slicing technique?
pyt...@bdurham.com wrote: Is there an efficient way to multi-slice a fixed with string into individual fields that's logically equivalent to the way one would slice a delimited string using .split()? Background: I'm parsing some very large, fixed line-width text files that have weekly columns of data (52 data columns plus related data). My current strategy is to loop through a list of slice()'s to build a list of the specific field values for each line. This is fine for small files, but seems inefficient. I'm hoping that there's a built-in (C based) or 3rd party module that is specifically designed for doing multiple field extractions at once. You could try the struct module: >>> import struct >>> struct.unpack("3s4s1s", b"123abcdX") ('123', 'abcd', 'X') -- http://mail.python.org/mailman/listinfo/python-list
Efficient multi-slicing technique?
Is there an efficient way to multi-slice a fixed with string into individual fields that's logically equivalent to the way one would slice a delimited string using .split()? Background: I'm parsing some very large, fixed line-width text files that have weekly columns of data (52 data columns plus related data). My current strategy is to loop through a list of slice()'s to build a list of the specific field values for each line. This is fine for small files, but seems inefficient. I'm hoping that there's a built-in (C based) or 3rd party module that is specifically designed for doing multiple field extractions at once. Thank you, Malcolm -- http://mail.python.org/mailman/listinfo/python-list