Efficient multi-slicing technique?

2009-01-25 Thread python
Is there an efficient way to multi-slice a fixed with string into
individual fields that's logically equivalent to the way one
would slice a delimited string using .split()?
Background: I'm parsing some very large, fixed line-width text
files that have weekly columns of data (52 data columns plus
related data). My current strategy is to loop through a list of
slice()'s to build a list of the specific field values for each
line. This is fine for small files, but seems inefficient. I'm
hoping that there's a built-in (C based) or 3rd party module that
is specifically designed for doing multiple field extractions at
once.
Thank you,
Malcolm
--
http://mail.python.org/mailman/listinfo/python-list


Re: Efficient multi-slicing technique?

2009-01-25 Thread MRAB

pyt...@bdurham.com wrote:
Is there an efficient way to multi-slice a fixed with string into 
individual fields that's logically equivalent to the way one would slice 
a delimited string using .split()?


Background: I'm parsing some very large, fixed line-width text files 
that have weekly columns of data (52 data columns plus related data). My 
current strategy is to loop through a list of slice()'s to build a list 
of the specific field values for each line. This is fine for small 
files, but seems inefficient. I'm hoping that there's a built-in (C 
based) or 3rd party module that is specifically designed for doing 
multiple field extractions at once.



You could try the struct module:

 import struct
 struct.unpack(3s4s1s, b123abcdX)
('123', 'abcd', 'X')

--
http://mail.python.org/mailman/listinfo/python-list


Re: Efficient multi-slicing technique?

2009-01-25 Thread Tim Chase

Is there an efficient way to multi-slice a fixed with string
into individual fields that's logically equivalent to the way
one would slice a delimited string using .split()? Background:
I'm parsing some very large, fixed line-width text files that
have weekly columns of data (52 data columns plus related
data). My current strategy is to loop through a list of 
slice()'s to build a list of the specific field values for

each line. This is fine for small files, but seems
inefficient. I'm hoping that there's a built-in (C based)


I'm not sure if it's more efficient, but there's the struct 
module[1]:


  from struct import unpack
  for line in file('sample.txt'):
(num, a, b, c, nl) = unpack(2s9s7s4sc, line)
print num:, repr(num)
print a:, repr(a)
print b:, repr(b)
print c:, repr(c)

Adjust the formatting string for your data (the last c is the 
newline character -- you might be able to use x here to just 
ignore the byte so it doesn't get returned). The sample data I 
threw was 2/9/7/4 character data.  The general pattern would be


  lengths = [3,18,24,5,1,8]
  FORMAT_STR = (
''.join(%ss % length for length in lengths) +
'c')
  for line in file(INFILE):
(f1, f2,..., fn, _) = unpack(FORMAT_STR, line)


-tkc

[1]
http://docs.python.org/library/struct.html







--
http://mail.python.org/mailman/listinfo/python-list


Re: Efficient multi-slicing technique?

2009-01-25 Thread python
Tim,

 I'm not sure if it's more efficient, but there's the struct module:
 http://docs.python.org/library/struct.html

Thanks for your suggestion. I've been experimenting with this technique,
but my initial tests don't show any performance improvements over using
slice() objects to slice a string. However, I missed the nuance of using
'x' to mark filler bytes - I'm going to see if this makes a difference
(it may as I am skipping over several columns of input that I've been
currently returning as ignored values)

reading your link to doc ... wait ... it looks like I can 'compile'
struct strings using by using a Struct class vs. the using the module's
basic unpack() function. This sounds like the difference between using
compiled regular expressions vs. re-compiling a regular expression on
every use. I'll see if this makes a difference and report back to the
list.

Regards,
Malcolm






--
http://mail.python.org/mailman/listinfo/python-list


Re: Efficient multi-slicing technique?

2009-01-25 Thread Tim Chase

I'm not sure if it's more efficient, but there's the struct
module: http://docs.python.org/library/struct.html


Thanks for your suggestion. I've been experimenting with this
technique, but my initial tests don't show any performance
improvements over using slice() objects to slice a string.
However, I missed the nuance of using 'x' to mark filler bytes
- I'm going to see if this makes a difference (it may as I am
skipping over several columns of input that I've been 
currently returning as ignored values)


I don't expect it will make a great deal of difference -- there's
not much room to improve the process.  Are you actually
experiencing efficiency problems?  I regularly use slice
unpacking (without reaching for the struct module) with no
noteworthy performance impact beyond the cost of scanning the
file and doing the processing on those lines (and these are text
files several hundred megs in size).  When I omit my processing
code and just skim through the file, the difference between
slice-unpacking and not slice-unpacking is in the sub-second range.


reading your link to doc ... wait ... it looks like I can
'compile' struct strings using by using a Struct class vs. the
using the module's basic unpack() function. This sounds like
the difference between using compiled regular expressions vs.
re-compiling a regular expression on every use. I'll see if
this makes a difference and report back to the list.


I don't expect it will...in the code for the struct.py I've got
here in my 2.5 distribution, it maintains an internal cache of
compiled strings, so unless you have more than _MAXCACHE=100
formatting strings, it's not something you're really have to
worry about.  (in my main data-processing/ETL app, I can't
envision having more than about 20 formatting strings, if I went
that route)

-tkc



--
http://mail.python.org/mailman/listinfo/python-list