Re: [Tutor] Reading binary files #2
etrade.griffi...@dsl.pipex.com wrote I have attached an example of the file in ASCII format and the equivalent unformatted version. Comparing them in vim... It doesn't look too bad except for the DATABEGI / DATAEND message format. That could be tricky to unravel but we have no clear format for MESS. But I assume that all the stuff between BEG and END is supposed to be effectively nested?. it gets to a data item that has no additional associated data, then seems to have got 4 bytes ahead of itself. You are creating a format string of 0d but I'm not sure how struct behaves with zero lenths... HTH, Alan G. == # Test function to write/read from unformatted files import sys import struct # Read file in one go in_file = open(test.bin,rb) data = in_file.read() in_file.close() # Initialise nrec = len(data) stop = 0 items = [] # Read data until EOF encountered while stop nrec: # extract data structure start, stop = stop, stop + struct.calcsize('4s8si4s8s') vals = struct.unpack('4s8si4s8s', data[start:stop]) items.extend(vals) print stop, vals # define format of subsequent data nval = int(vals[2]) if vals[3] == 'INTE': fmt_string = 'i' elif vals[3] == 'CHAR': fmt_string = '8s' elif vals[3] == 'LOGI': fmt_string = 'i' elif vals[3] == 'REAL': fmt_string = 'f' elif vals[3] == 'DOUB': fmt_string = 'd' elif vals[3] == 'MESS': fmt_string = '%dd' % nval else: print Unknown data type ... exiting print items sys.exit(0) # extract data for i in range(0,nval): start, stop = stop, stop + struct.calcsize(fmt_string) vals = struct.unpack(fmt_string, data[start:stop]) items.extend(vals) # trailing spaces if nval 0: start, stop = stop, stop + struct.calcsize('4s') vals = struct.unpack('4s', data[start:stop]) # All data read so print items print items - Visit Pipex Business: The homepage for UK Small Businesses Go to http://www.pipex.co.uk/business-services ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Reading binary files #2
etrade.griffi...@dsl.pipex.com wrote: Hi following last week's discussion with Bob Gailer about reading unformatted FORTRAN files, I have attached an example of the file in ASCII format and the equivalent unformatted version. Thank you. It is good to have real data to work with. Below is some code that works OK until it gets to a data item that has no additional associated data, then seems to have got 4 bytes ahead of itself. Thank you. It is good to have real code to work with. I though I had trapped this but it appears not. I think the issue is asociated with newline characters or the unformatted equivalent. I think not, But we will see. I fail to see where the problem is. The data printed below seems to agree with the files you sent. What am I missing? FWIW a few observations re coding style and techniques. 1) put the formats in a dictionary before the while loop: formats = {'INTE': 'i', 'CHAR': '8s', 'LOGI': 'i', 'REAL': 'f', 'DOUB': 'd', 'MESS': ''d,} 2) retrieve the format in the while loop from the dictionary: format = formats[vals[3]] 3) condense the 3 infile lines: data = open(test.bin,rb).read() 4) nrec is a misleading name (to me it means # of records), nbytes would be better. 5) Be consistent with the format between calcsize and unpack: struct.calcsize('4s8si4s8s') 6) Use meaningful variable names instead of val for the unpacked data: blank, name, length, typ = struct.unpack ... etc 7) The format for MESS should be 'd' rather than '%dd' % nval. When nval is 0 the for loop will make 0 cycles. 8) You don't have a format for DATA (BEGI); therefore the prior format (for CHAR) is being applied. The formats are the same so it does not matter but could be confusing later. # Test function to write/read from unformatted files import sys import struct # Read file in one go in_file = open(test.bin,rb) data = in_file.read() in_file.close() # Initialise nrec = len(data) stop = 0 items = [] # Read data until EOF encountered while stop nrec: # extract data structure start, stop = stop, stop + struct.calcsize('4s8si4s8s') vals = struct.unpack('4s8si4s8s', data[start:stop]) items.extend(vals) print stop, vals # define format of subsequent data nval = int(vals[2]) if vals[3] == 'INTE': fmt_string = 'i' elif vals[3] == 'CHAR': fmt_string = '8s' elif vals[3] == 'LOGI': fmt_string = 'i' elif vals[3] == 'REAL': fmt_string = 'f' elif vals[3] == 'DOUB': fmt_string = 'd' elif vals[3] == 'MESS': fmt_string = '%dd' % nval else: print Unknown data type ... exiting print items sys.exit(0) # extract data for i in range(0,nval): start, stop = stop, stop + struct.calcsize(fmt_string) vals = struct.unpack(fmt_string, data[start:stop]) items.extend(vals) # trailing spaces if nval 0: start, stop = stop, stop + struct.calcsize('4s') vals = struct.unpack('4s', data[start:stop]) # All data read so print items print items - Visit Pipex Business: The homepage for UK Small Businesses Go to http://www.pipex.co.uk/business-services ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Reading binary files #2
Hi Bob some replies below. One thing I noticed with the full file was that I ran into problems when the number of records was 10500, and the file read got misaligned. Presumably 10500 is still within the range of int? Best regards Alun At 17:49 09/02/2009, bob gailer wrote: etrade.griffi...@dsl.pipex.com wrote: Hi following last week's discussion with Bob Gailer about reading unformatted FORTRAN files, I have attached an example of the file in ASCII format and the equivalent unformatted version. Thank you. It is good to have real data to work with. Below is some code that works OK until it gets to a data item that has no additional associated data, then seems to have got 4 bytes ahead of itself. Thank you. It is good to have real code to work with. I though I had trapped this but it appears not. I think the issue is asociated with newline characters or the unformatted equivalent. I think not, But we will see. I fail to see where the problem is. The data printed below seems to agree with the files you sent. What am I missing? When I run the program it exits in the middle but should run through to the end. The output to the console was 236 ('\x00\x00\x00\x10', 'DATABEGI', 0, 'MESS', '\x00\x00\x00\x10\x00\x00\x00\x10') 264 ('TIME', '\x00\x00\x00\x01', 1380270412, '\x00\x00\x00\x10', '\x00\x00\x00\x04\x00\x00\x00\x00') Here TIME is in vals[0] when it should be in vals[1] and so on. I found the problem earlier today and I re-wrote the main loop as follows (before I saw your helpful coding style comments): while stop nrec: # extract data structure start, stop = stop, stop + struct.calcsize('4s8si4s4s') vals = struct.unpack('4s8si4s4s', data[start:stop]) items.extend(vals[1:4]) print stop, vals # define format of subsequent data nval = int(vals[2]) if vals[3] == 'INTE': fmt_string = 'i' elif vals[3] == 'CHAR': fmt_string = '8s' elif vals[3] == 'LOGI': fmt_string = 'i' elif vals[3] == 'REAL': fmt_string = 'f' elif vals[3] == 'DOUB': fmt_string = 'd' elif vals[3] == 'MESS': fmt_string = '%ds' % nval else: print Unknown data type ... exiting print items[-40:] sys.exit(0) # leading spaces if nval 0: start, stop = stop, stop + struct.calcsize('4s') vals = struct.unpack('4s', data[start:stop]) # extract data for i in range(0,nval): start, stop = stop, stop + struct.calcsize(fmt_string) vals = struct.unpack(fmt_string, data[start:stop]) items.extend(vals) # trailing spaces if nval 0: start, stop = stop, stop + struct.calcsize('4s') vals = struct.unpack('4s', data[start:stop]) Now I get this output 232 ('\x00\x00\x00\x10', 'DATABEGI', 0, 'MESS', '\x00\x00\x00\x10') 256 ('\x00\x00\x00\x10', 'TIME', 1, 'REAL', '\x00\x00\x00\x10') and the script runs to the end FWIW a few observations re coding style and techniques. 1) put the formats in a dictionary before the while loop: formats = {'INTE': 'i', 'CHAR': '8s', 'LOGI': 'i', 'REAL': 'f', 'DOUB': 'd', 'MESS': ''d,} 2) retrieve the format in the while loop from the dictionary: format = formats[vals[3]] Neat!! 3) condense the 3 infile lines: data = open(test.bin,rb).read() I still don't quite trust myself to chain functions together, but I guess that's lack of practice 4) nrec is a misleading name (to me it means # of records), nbytes would be better. Agreed 5) Be consistent with the format between calcsize and unpack: struct.calcsize('4s8si4s8s') 6) Use meaningful variable names instead of val for the unpacked data: blank, name, length, typ = struct.unpack ... etc Will do 7) The format for MESS should be 'd' rather than '%dd' % nval. When nval is 0 the for loop will make 0 cycles. Wasn't sure about that one. MESS implies string but I wasn't sure what to do about a zero-length string 8) You don't have a format for DATA (BEGI); therefore the prior format (for CHAR) is being applied. The formats are the same so it does not matter but could be confusing later. DATABEGI should be a keyword to indicate the start of the proper data which has format MESS (ie string). You did make me look again at the MESS format and it should be '%ds' % nval and not '%dd' % nval ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reading binary files
eShopping wrote: Bob I am trying to read UNFORMATTED files. The files also occur as formatted files and the format string I provided is the string used to write the formatted version. I can read the formatted version OK. I (naively) assumed that the same format string was used for both files, the only differences being whether the FORTRAN WRITE statement indicated unformatted or formatted. WRITE UNFORMATTED dump memory to disk with no formatting. That is why we must do some analysis of the file to see where the data has been placed, how long the floats are, and what endian is being used. I'd like to examine the file myself. We might save a lot of time and energy that way. If it is not very large would you attach it to your reply. If it is very large you could either copy just the first 1000 or so bytes, or send the whole thing thru www.yousendit.com. At 21:41 03/02/2009, bob gailer wrote: First question: are you trying to work with the file written UNFORMATTED? If so read on. Well, did you read on? What reactions do you have? eShopping wrote: Data format: TIME 1 F 0.0 DISTANCE 10 F 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 F=float, D=double, L=logical, S=string etc The first part of the file should contain a string (eg TIME), an integer (1) and another string (eg F) so I tried using import struct in_file = open(file_name+.dat,rb) data = in_file.read() items = struct.unpack('sds', data) Now I get the error error: unpack requires a string argument of length 17 which has left me completely baffled! Did you open the file with mode 'b'? If not change that. You are passing the entire file to unpack when you should be giving it only the first line. That's why is is complaining about the length. We need to figure out the lengths of the lines. Consider the first line TIME 1 F 0.0 There were (I assume) 4 FORTRAN variables written here: character integer character float. Without knowing the lengths of the character variables we are at a loss as to what the struct format should be. Do you know their lengths? Is the last float or double? Try this: print data[:40] You should see something like: TIME...\x01\x00\x00\x00...F...\x00\x00\x00\x00...DISTANCE...\n\x00\x00\x00 where ... means 0 or more intervening stuff. It might be that the \x01 and the \n are in other places, as we also have to deal with byte order issues. Please do this and report back your results. And also the FORTRAN variable types if you have access to them. Apologies if this is getting a bit messy but the files are at a remote location and I forgot to bring copies home. I don't have access to the original FORTRAN program so I tried to emulate the reading the data using the Python script below. AFAIK the FORTRAN format line for the header is (1X, 1X, A8, 1X, 1X, I6, 1X, 1X, A1). If the data following is a float it is written using n(1X, F6.2) where n is the number of records picked up from the preceding header. # test program to read binary data import struct # create dummy data data = [] for i in range(0,10): data.append(float(i)) # write data to binary file b_file = open(test.bin,wb) b_file.write( %8s %6d %1s\n % (DISTANCE, len(data), F)) for x in data: b_file.write( %6.2f % x) You are still confusing text vs binary. The above writes text regardless of the file mode. If the FORTRAN file was written UNFORMATTED then you are NOT emulating that with the above program. The character data is read back in just fine, since there is no translation involved in the writing nor in the reading. The integer len(data) is being written as its text (character) representation (translating binary to text) but being read back in without translation. Also all the floating point data is going out as text. The file looks like (where b = blank) (how it would look in notepad): bbDISTANCEbb10bFbbb0.00bbb1.00bbb2.00 If you analyze this with 2s8s2si2s1s you will see 2s matches bb, 8s matches DISTANCE, 2s matches bb, i matches . (\x40\x40\x40\x40). The i tells unpack to shove those 4 bytes unaltered into a Python integer, resulting in 538976288. You can verify that: struct.unpack('i', '') (538976288,) Please either assure me you understand or are prepared for a more in depth tutorial. b_file.close() # read back data from file c_file = open(test.bin,rb) data = c_file.read() start, stop = 0, struct.calcsize(2s8s2si2s1s) items = struct.unpack(2s8s2si2s1s,data[start:stop]) print items print data[:40] I'm pretty sure that when I tried this at the other PC there were a bunch of \x00\x00 characters in the file but they don't appear in NotePad ... anyway, I thought the Python above would unpack the data but items appears as (' ', 'DISTANCE', ' ', 538976288, '10', ' ') which seems to be contain an extra item (538976288) Alun Griffiths -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor
Re: [Tutor] reading binary files
Bob sorry, I misread your email and thought it said read on if the file was FORMATTED. It wasn't so I didn't (but should have). I read the complete thread and it is getting a little messy so I have extracted your questions and added some answers. I'd like to examine the file myself. We might save a lot of time and energy that way. If it is not very large would you attach it to your reply. If it is very large you could either copy just the first 1000 or so bytes, or send the whole thing thru www.yousendit.com. The file is around 800 Mb but I can't get hold of it until next week so suggest starting a new topic once I have a cut-down copy. Well, did you read on? What reactions do you have? I did (finally) read on and I am still a little confused, though less than before. I guess the word UNFORMATTED means that the file has no format though it presumably has some structure? One major hurdle is that I am not really sure about the difference between a Python binary file and a FORTRAN UNFORMATTED file so any pointers would be gratefully received The file looks like (where b = blank) (how it would look in notepad): bbDISTANCEbb10bFbbb0.00bbb1.00bbb2.00 If you analyze this with 2s8s2si2s1s you will see 2s matches bb, 8s matches DISTANCE, 2s matches bb, i matches . (\x40\x40\x40\x40). The i tells unpack to shove those 4 bytes unaltered into a Python integer, resulting in 538976288. You can verify that: struct.unpack('i', '') (538976288,) Please either assure me you understand or are prepared for a more in depth tutorial. I now understand why Python gave me the results it did ... it looks like reading the FORTRAN file will be a non-trivial task so probably best to wait until I can post a copy of it. Thanks for your help Alun Griffiths ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reading binary files
eShopping etrade.griffi...@dsl.pipex.com wrote I now understand why Python gave me the results it did ... it looks like reading the FORTRAN file will be a non-trivial task so probably best to wait until I can post a copy of it. You don't say which OS you are on but you can read the binary file into a hex editor and see the structure. If you are on *nix you can use od -x and if on Windows run debug and use the d command to display the file as hex Using that you should be able to determine whether fields are fixed length or delimited by a particular character or tagged with a length prefix etc. HTH, -- Alan Gauld Author of the Learn to Program web site http://www.freenetpages.co.uk/hp/alan.gauld ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reading binary files
eShopping wrote: The file is around 800 Mb but I can't get hold of it until next week so suggest starting a new topic once I have a cut-down copy. OK will wait with bated breath. Well, did you read on? What reactions do you have? I did (finally) read on and I am still a little confused, though less than before. I guess the word UNFORMATTED means that the file has no format Depends on what you mean by format. When you use % formatting in Python it is the same thing as a FORMATTED WRITE in FORTRAN - a set of directives that direct the translation of data to human readable text. Files per se are a sequence of bytes. As such they have no format. When we examine a file we attempt to make sense of the bytes. Some of the bytes may represent ASCII printable characters - other not.The body of this email is a sequence of ASCII printable characters that make sense to you when you read them. The file written UNFORMATTED has some ASCII printable characters that you can read (e.g. DISTANCE), some that you can recognize as letters, numbers, etc but are not English words, and non-printable characters that show up as garbage symbols or not at all. Those that are not readable are the internal representation of numbers. though it presumably has some structure? One major hurdle is that I am not really sure about the difference between a Python binary file and a FORTRAN UNFORMATTED file so any pointers would be gratefully received There is no such thing as a Python binary file. When you open a file with mode 'b' you are asking the file system to ignore line-ends. If you do not specify 'b' then the file system translates line-ends into \n when reading and translates \n back to line-ends. The reason for this is that different OS file systems use different codes for line-ends. By translating them to and from \n the Python program becomes OS independent. Windows uses ctrl-M ctrl-J (carriage return - line feed; \x0d\x0a). Linux/Unix uses ctrl-J (line feed; \x0a). Mac uses ctrl-M (carriage return; \x0d). Python uniformly translates these to \n (x0a) When processing files written without line-ends (e.g. UNFORMATTED) there may be line-end characters or sequences that must NOT be treated as line-ends. Hence mode 'b' Example: x=open('x','w') # write normal allowing \n to be translated to the OS line end. x.write(Hello\n) x=open('x','rb') # read binary, avoiding translation. x.read() 'Hello\r\n' where \r = \x0d -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reading binary files
Sorry, still having problems I am trying to read data from a file that has format item_name num_items item_type items eg TIME 1 0.0 DISTANCE 10 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Where is the item_type? Ooops, the data format should look like this: TIME 1 F 0.0 DISTANCE 10 F 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 F=float, D=double, L=logical, S=string etc I can read this if the data are in ASCII format using in_file = open(my_file.dat,r) data1 = in_file.read() tokens = data1.split() It might be easier to process line by line using readline or readlines rather than read but otherwise, ok so far... and then stepping through the resulting list but the data also appear in the same format in a binary file. When you say a binary file do you mean an ASCII file encoded into binary using some standard algorithm? Or do you mean the data is binary so that, for example, the number 1 would appear as 4 bytes? If so do you know how strings (the name) are delimited? Also how many could be present - is length a single or multiple bytes? and are the reors fixed length or variable? If variable what is the field/record separator? Sorry, no idea what the difference is. All I know is that the data were written by a FORTRAN program using the UNFORMATTED argument in the WRITE statement and that if they had been written FORMATTED then we would get afile that looks something like the example above You may need to load the file into a hex editor of debugger to determine the answers... Having done that the struct module will allow you to read the data. You can see a basic example of using struct in my tutorial topic about handling files. The first part of the file should contain a string (eg TIME), an integer (1) and another string (eg F) so I tried using import struct in_file = open(file_name+.dat,rb) data = in_file.read() items = struct.unpack('sds', data) Now I get the error error: unpack requires a string argument of length 17 which has left me completely baffled! -- Message: 4 Date: Mon, 02 Feb 2009 14:53:59 -0700 From: Bernd Prager be...@prager.ws Subject: [Tutor] question about mpmath product expression To: tutor@python.org Message-ID: ac7e7f56dc4bc0903dc7df8861f9b...@prager.ws Content-Type: text/plain; charset=UTF-8 Does anybody know if there is a precision difference when I use mpmath and take an expression: from mpmath import * mp.dps = 100 mu0 = [mpf('4') * pi * power(10, -7) rather then: mu0 = fprod([mpf('4'), pi, power(10, -7)]) ? Thanks, -- Bernd -- Message: 5 Date: Mon, 2 Feb 2009 14:46:18 -0800 (PST) From: Bernard Rankin beranki...@yahoo.com Subject: [Tutor] regex: not start with FOO To: Tutor@python.org Message-ID: 528538.84097...@web112218.mail.gq1.yahoo.com Content-Type: text/plain; charset=us-ascii Hello, I'd like to match any line that does not start with FOO. (Using just a reg-ex rule) 1) What is the effective difference between: (?!^FOO).* ^(?!FOO).* 2) Is there a better way to do this? Thanks, :) -- Message: 6 Date: Mon, 02 Feb 2009 15:50:18 -0800 From: WM. wfergus...@socal.rr.com Subject: [Tutor] newton's sqrt formula To: tutor@python.org Message-ID: 498786ba.6090...@socal.rr.com Content-Type: text/plain; charset=ISO-8859-1; format=flowed # program to find square root square = input ('Please enter a number to be rooted, ') square = square * 1.0 guess = input('Please guess at the root, ') guess = guess * 1.0 newguess = 0. while guess**2 != square: # Newton's formula newguess = guess - (guess * guess - square) / (guess * 2) guess = newguess guess**2 - square print print print guess, ' is the square root of ', square print print print 'bye' Last month there was a square root program discussed. I wondered if the tide of my ignorance had receded enough that I could take a whack at messing with it. I offer this rewrite for your critique. Can it be terser, faster, prettier? Thank you. -- Message: 7 Date: Tue, 3 Feb 2009 00:44:27 - From: Alan Gauld alan.ga...@btinternet.com Subject: Re: [Tutor] newton's sqrt formula To: tutor@python.org Message-ID: gm841b$l9...@ger.gmane.org Content-Type: text/plain; format=flowed; charset=iso-8859-1; reply-type=response WM. wfergus...@socal.rr.com wrote square = input ('Please enter a number to be rooted, ') square = square * 1.0 Use raw_input() instead of input() and don't multiply by 1.0 - instead convert to float using float(): square = float( raw_input ('Please enter a number to be rooted, ')) guess = input('Please guess at the root, ') guess = guess * 1.0 newguess = 0. while guess**2 != square: #
Re: [Tutor] reading binary files
Bob At 19:52 03/02/2009, you wrote: etrade.griffi...@dsl.pipex.com wrote: Data format: TIME 1 F 0.0 DISTANCE 10 F 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 F=float, D=double, L=logical, S=string etc The first part of the file should contain a string (eg TIME), an integer (1) and another string (eg F) so I tried using import struct in_file = open(file_name+.dat,rb) data = in_file.read() items = struct.unpack('sds', data) Now I get the error error: unpack requires a string argument of length 17 which has left me completely baffled! Did you open the file with mode 'b'? If not change that. You are passing the entire file to unpack when you should be giving it only the first line. That's why is is complaining about the length. We need to figure out the lengths of the lines. Consider the first line TIME 1 F 0.0 There were (I assume) 4 FORTRAN variables written here: character integer character float. Without knowing the lengths of the character variables we are at a loss as to what the struct format should be. Do you know their lengths? Is the last float or double? Try this: print data[:40] You should see something like: TIME...\x01\x00\x00\x00...F...\x00\x00\x00\x00...DISTANCE...\n\x00\x00\x00 where ... means 0 or more intervening stuff. It might be that the \x01 and the \n are in other places, as we also have to deal with byte order issues. Please do this and report back your results. And also the FORTRAN variable types if you have access to them. Apologies if this is getting a bit messy but the files are at a remote location and I forgot to bring copies home. I don't have access to the original FORTRAN program so I tried to emulate the reading the data using the Python script below. AFAIK the FORTRAN format line for the header is (1X, 1X, A8, 1X, 1X, I6, 1X, 1X, A1). If the data following is a float it is written using n(1X, F6.2) where n is the number of records picked up from the preceding header. # test program to read binary data import struct # create dummy data data = [] for i in range(0,10): data.append(float(i)) # write data to binary file b_file = open(test.bin,wb) b_file.write( %8s %6d %1s\n % (DISTANCE, len(data), F)) for x in data: b_file.write( %6.2f % x) b_file.close() # read back data from file c_file = open(test.bin,rb) data = c_file.read() start, stop = 0, struct.calcsize(2s8s2si2s1s) items = struct.unpack(2s8s2si2s1s,data[start:stop]) print items print data[:40] I'm pretty sure that when I tried this at the other PC there were a bunch of \x00\x00 characters in the file but they don't appear in NotePad ... anyway, I thought the Python above would unpack the data but items appears as (' ', 'DISTANCE', ' ', 538976288, '10', ' ') which seems to be contain an extra item (538976288) Alun Griffiths ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reading binary files
etrade.griffi...@dsl.pipex.com wrote: Data format: TIME 1 F 0.0 DISTANCE 10 F 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 F=float, D=double, L=logical, S=string etc The first part of the file should contain a string (eg TIME), an integer (1) and another string (eg F) so I tried using import struct in_file = open(file_name+.dat,rb) data = in_file.read() items = struct.unpack('sds', data) Now I get the error error: unpack requires a string argument of length 17 which has left me completely baffled! Did you open the file with mode 'b'? If not change that. You are passing the entire file to unpack when you should be giving it only the first line. That's why is is complaining about the length. We need to figure out the lengths of the lines. Consider the first line TIME 1 F 0.0 There were (I assume) 4 FORTRAN variables written here: character integer character float. Without knowing the lengths of the character variables we are at a loss as to what the struct format should be. Do you know their lengths? Is the last float or double? Try this: print data[:40] You should see something like: TIME...\x01\x00\x00\x00...F...\x00\x00\x00\x00...DISTANCE...\n\x00\x00\x00 where ... means 0 or more intervening stuff. It might be that the \x01 and the \n are in other places, as we also have to deal with byte order issues. Please do this and report back your results. And also the FORTRAN variable types if you have access to them. -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reading binary files
First question: are you trying to work with the file written UNFORMATTED? If so read on. If you are working with a file formatted (1X, 1X, A8, 1X, 1X, I6, 1X, 1X, A1) then we have a completely different issue to deal with. Do not read on, instead let us know. eShopping wrote: Data format: TIME 1 F 0.0 DISTANCE 10 F 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 F=float, D=double, L=logical, S=string etc The first part of the file should contain a string (eg TIME), an integer (1) and another string (eg F) so I tried using import struct in_file = open(file_name+.dat,rb) data = in_file.read() items = struct.unpack('sds', data) Now I get the error error: unpack requires a string argument of length 17 which has left me completely baffled! Did you open the file with mode 'b'? If not change that. You are passing the entire file to unpack when you should be giving it only the first line. That's why is is complaining about the length. We need to figure out the lengths of the lines. Consider the first line TIME 1 F 0.0 There were (I assume) 4 FORTRAN variables written here: character integer character float. Without knowing the lengths of the character variables we are at a loss as to what the struct format should be. Do you know their lengths? Is the last float or double? Try this: print data[:40] You should see something like: TIME...\x01\x00\x00\x00...F...\x00\x00\x00\x00...DISTANCE...\n\x00\x00\x00 where ... means 0 or more intervening stuff. It might be that the \x01 and the \n are in other places, as we also have to deal with byte order issues. Please do this and report back your results. And also the FORTRAN variable types if you have access to them. Apologies if this is getting a bit messy but the files are at a remote location and I forgot to bring copies home. I don't have access to the original FORTRAN program so I tried to emulate the reading the data using the Python script below. AFAIK the FORTRAN format line for the header is (1X, 1X, A8, 1X, 1X, I6, 1X, 1X, A1). If the data following is a float it is written using n(1X, F6.2) where n is the number of records picked up from the preceding header. # test program to read binary data import struct # create dummy data data = [] for i in range(0,10): data.append(float(i)) # write data to binary file b_file = open(test.bin,wb) b_file.write( %8s %6d %1s\n % (DISTANCE, len(data), F)) for x in data: b_file.write( %6.2f % x) You are still confusing text vs binary. The above writes text regardless of the file mode. If the FORTRAN file was written UNFORMATTED then you are NOT emulating that with the above program. The character data is read back in just fine, since there is no translation involved in the writing nor in the reading. The integer len(data) is being written as its text (character) representation (translating binary to text) but being read back in without translation. Also all the floating point data is going out as text. The file looks like (where b = blank) (how it would look in notepad): bbDISTANCEbb10bFbbb0.00bbb1.00bbb2.00 If you analyze this with 2s8s2si2s1s you will see 2s matches bb, 8s matches DISTANCE, 2s matches bb, i matches . (\x40\x40\x40\x40). The i tells unpack to shove those 4 bytes unaltered into a Python integer, resulting in 538976288. You can verify that: struct.unpack('i', '') (538976288,) Please either assure me you understand or are prepared for a more in depth tutorial. b_file.close() # read back data from file c_file = open(test.bin,rb) data = c_file.read() start, stop = 0, struct.calcsize(2s8s2si2s1s) items = struct.unpack(2s8s2si2s1s,data[start:stop]) print items print data[:40] I'm pretty sure that when I tried this at the other PC there were a bunch of \x00\x00 characters in the file but they don't appear in NotePad ... anyway, I thought the Python above would unpack the data but items appears as (' ', 'DISTANCE', ' ', 538976288, '10', ' ') which seems to be contain an extra item (538976288) Alun Griffiths -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reading binary files
Bob I am trying to read UNFORMATTED files. The files also occur as formatted files and the format string I provided is the string used to write the formatted version. I can read the formatted version OK. I (naively) assumed that the same format string was used for both files, the only differences being whether the FORTRAN WRITE statement indicated unformatted or formatted. Best regards Alun Griffiths At 21:41 03/02/2009, bob gailer wrote: First question: are you trying to work with the file written UNFORMATTED? If so read on. If you are working with a file formatted (1X, 1X, A8, 1X, 1X, I6, 1X, 1X, A1) then we have a completely different issue to deal with. Do not read on, instead let us know. eShopping wrote: Data format: TIME 1 F 0.0 DISTANCE 10 F 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 F=float, D=double, L=logical, S=string etc The first part of the file should contain a string (eg TIME), an integer (1) and another string (eg F) so I tried using import struct in_file = open(file_name+.dat,rb) data = in_file.read() items = struct.unpack('sds', data) Now I get the error error: unpack requires a string argument of length 17 which has left me completely baffled! Did you open the file with mode 'b'? If not change that. You are passing the entire file to unpack when you should be giving it only the first line. That's why is is complaining about the length. We need to figure out the lengths of the lines. Consider the first line TIME 1 F 0.0 There were (I assume) 4 FORTRAN variables written here: character integer character float. Without knowing the lengths of the character variables we are at a loss as to what the struct format should be. Do you know their lengths? Is the last float or double? Try this: print data[:40] You should see something like: TIME...\x01\x00\x00\x00...F...\x00\x00\x00\x00...DISTANCE...\n\x00\x00\x00 where ... means 0 or more intervening stuff. It might be that the \x01 and the \n are in other places, as we also have to deal with byte order issues. Please do this and report back your results. And also the FORTRAN variable types if you have access to them. Apologies if this is getting a bit messy but the files are at a remote location and I forgot to bring copies home. I don't have access to the original FORTRAN program so I tried to emulate the reading the data using the Python script below. AFAIK the FORTRAN format line for the header is (1X, 1X, A8, 1X, 1X, I6, 1X, 1X, A1). If the data following is a float it is written using n(1X, F6.2) where n is the number of records picked up from the preceding header. # test program to read binary data import struct # create dummy data data = [] for i in range(0,10): data.append(float(i)) # write data to binary file b_file = open(test.bin,wb) b_file.write( %8s %6d %1s\n % (DISTANCE, len(data), F)) for x in data: b_file.write( %6.2f % x) You are still confusing text vs binary. The above writes text regardless of the file mode. If the FORTRAN file was written UNFORMATTED then you are NOT emulating that with the above program. The character data is read back in just fine, since there is no translation involved in the writing nor in the reading. The integer len(data) is being written as its text (character) representation (translating binary to text) but being read back in without translation. Also all the floating point data is going out as text. The file looks like (where b = blank) (how it would look in notepad): bbDISTANCEbb10bFbbb0.00bbb1.00bbb2.00 If you analyze this with 2s8s2si2s1s you will see 2s matches bb, 8s matches DISTANCE, 2s matches bb, i matches . (\x40\x40\x40\x40). The i tells unpack to shove those 4 bytes unaltered into a Python integer, resulting in 538976288. You can verify that: struct.unpack('i', '') (538976288,) Please either assure me you understand or are prepared for a more in depth tutorial. b_file.close() # read back data from file c_file = open(test.bin,rb) data = c_file.read() start, stop = 0, struct.calcsize(2s8s2si2s1s) items = struct.unpack(2s8s2si2s1s,data[start:stop]) print items print data[:40] I'm pretty sure that when I tried this at the other PC there were a bunch of \x00\x00 characters in the file but they don't appear in NotePad ... anyway, I thought the Python above would unpack the data but items appears as (' ', 'DISTANCE', ' ', 538976288, '10', ' ') which seems to be contain an extra item (538976288) Alun Griffiths -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reading binary files
etrade.griffi...@dsl.pipex.com wrote I am trying to read data from a file that has format item_name num_items item_type items eg TIME 1 0.0 DISTANCE 10 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Where is the item_type? I can read this if the data are in ASCII format using in_file = open(my_file.dat,r) data1 = in_file.read() tokens = data1.split() It might be easier to process line by line using readline or readlines rather than read but otherwise, ok so far... and then stepping through the resulting list but the data also appear in the same format in a binary file. When you say a binary file do you mean an ASCII file encoded into binary using some standard algorithm? Or do you mean the data is binary so that, for example, the number 1 would appear as 4 bytes? If so do you know how strings (the name) are delimited? Also how many could be present - is length a single or multiple bytes? and are the reors fixed length or variable? If variable what is the field/record separator? You may need to load the file into a hex editor of debugger to determine the answers... Having done that the struct module will allow you to read the data. You can see a basic example of using struct in my tutorial topic about handling files. HTH, -- Alan Gauld Author of the Learn to Program web site http://www.freenetpages.co.uk/hp/alan.gauld ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] reading binary files
On Mon, 2009-02-02 at 11:31 +, etrade.griffi...@dsl.pipex.com wrote: Hi I am trying to read data from a file that has format item_name num_items item_type items eg TIME 1 0.0 DISTANCE 10 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 TIME 1 1.0 DISTANCE 10 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 I can read this if the data are in ASCII format using in_file = open(my_file.dat,r) data1 = in_file.read() tokens = data1.split() and then stepping through the resulting list but the data also appear in the same format in a binary file. I tried converting the binary file to an ASCII file using ifile = open(my_file.dat,rb) ofile = open(new_file.dat,w) base64.decode(ifile, ofile) but that gave the error Error: Incorrect padding. I imagine that there is a straightforward way of doing this but haven't found it so far. Would be grateful for any suggestions! Thanks Alun Griffiths Honestly I'm not sure what you're asking for but in general for reading binary data the I use the struct module. Check it out in the documentation. John Purser ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor