Re: Reading by positions plain text files

2010-12-13 Thread javivd
On Dec 12, 11:21 pm, Dennis Lee Bieber wlfr...@ix.netcom.com wrote:
 On Sun, 12 Dec 2010 07:02:13 -0800 (PST), javivd
 javiervan...@gmail.com declaimed the following in
 gmane.comp.python.general:



  f = open(r'c:c:\somefile.txt', 'w')

  f.write('0123456789\n0123456789\n0123456789')

         Not the most explanatory sample data... It would be better if the
 records had different contents.

  f.close()

  f = open(r'c:\somefile.txt', 'r')

  for line in f:

         Here you extract one line from the file

      f.seek(3,0)
      print f.read(1) #just to know if its printing the rigth column

         And here you ignored the entire line you read, seeking to the fourth
 byte from the beginning of the file, andreadingjust one byte from it.

         I have no idea of how seek()/read() behaves relative to line
 iteration in the for loop... Given the small size of the test data set
 it is quite likely that the first for line in f resulted in the entire
 file being read into a buffer, and that buffer scanned to find the line
 ending and return the data preceding it; then the buffer position is set
 to after that line ending so the next for line continues from that
 point.

         But in a situation with a large data set, or an unbuffered I/O
 system, the seek()/read() could easily result in resetting the file
 position used by the for line, so that the second call returns
 456789\n... And all subsequent calls too, resulting in an infinite
 loop.

         Presuming the assignment requires pulling multiple selected fields
 from individual records, where each record is of the same
 format/spacing, AND that the field selection can not be preprogrammed...

 Sample data file (use fixed width font to view):
 -=-=-=-=-=-
 Wulfraed       09Ranger  1915
 Bask Euren     13Cleric  1511
 Aethelwulf     07Mage    0908
 Cwiculf        08Mage    1008
 -=-=-=-=-=-

 Sample format definition file:
 -=-=-=-=-=-
 Name    0-14
 Level   15-16
 Class   17-24
 THAC0   25-26
 Armor   27-28
 -=-=-=-=-=-

 Code to process (Python 2.5, with minimal error handling):
 -=-=-=-=-=-

 class Extractor(object):
     def __init__(self, formatFile):
         ff = open(formatFile, r)
         self._format = {}
         self._length = 0
         for line in ff:
             form = line.split(\t) #file must be tab separated
             if len(form) != 2:
                 print Invalid file format definition: %s % line
                 continue
             name = form[0]
             columns = form[1].split(-)
             if len(columns) == 1:   #single column definition
                 start = int(columns[0])
                 end = start
             elif len(columns) == 2:
                 start = int(columns[0])
                 end = int(columns[1])
             else:
                 print Invalid column definition: %s % form[1]
                 continue
             self._format[name] = (start, end)
             self._length = max(self._length, end)
         ff.close()

     def __call__(self, line):
         data = {}
         if len(line)  self._length:
             print Data line is too short for required format: ignored
         else:
             for (name, (start, end)) in self._format.items():
                 data[name] = line[start:end+1]
         return data

 if __name__ == __main__:
     FORMATFILE = SampleFormat.tsv
     DATAFILE = SampleData.txt

     characterExtractor = Extractor(FORMATFILE)

     df = open(DATAFILE, r)
     for line in df:
         fields = characterExtractor(line)
         for (name, value) in fields.items():
             print Field name: '%s'\t\tvalue: '%s' % (name, value)
         print

     df.close()
 -=-=-=-=-=-

 Output from running above code:
 -=-=-=-=-=-
 Field name: 'Armor'             value: '15'
 Field name: 'THAC0'             value: '19'
 Field name: 'Level'             value: '09'
 Field name: 'Class'             value: 'Ranger  '
 Field name: 'Name'              value: 'Wulfraed       '

 Field name: 'Armor'             value: '11'
 Field name: 'THAC0'             value: '15'
 Field name: 'Level'             value: '13'
 Field name: 'Class'             value: 'Cleric  '
 Field name: 'Name'              value: 'Bask Euren     '

 Field name: 'Armor'             value: '08'
 Field name: 'THAC0'             value: '09'
 Field name: 'Level'             value: '07'
 Field name: 'Class'             value: 'Mage    '
 Field name: 'Name'              value: 'Aethelwulf     '

 Field name: 'Armor'             value: '08'
 Field name: 'THAC0'             value: '10'
 Field name: 'Level'             value: '08'
 Field name: 'Class'             value: 'Mage    '
 Field name: 'Name'              value: 'Cwiculf        '
 -=-=-=-=-=-

         Note that string fields have not been trimmed, also numeric fields
 are still intextformat... The format definition file would need to be
 expanded to include a string, integer, float (and Boolean?) code
 in order for the extractor to do proper type 

Re: Reading by positions plain text files

2010-12-12 Thread javivd
On Dec 1, 7:15 am, Tim Harig user...@ilthio.net wrote:
 On 2010-12-01, javivd javiervan...@gmail.com wrote:









  On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote:
  On 2010-11-30, javivd javiervan...@gmail.com wrote:

   I have a case now in wich anotherfilehas been provided (besides the
   database) that tells me in wich column of thefileis every variable,
   because there isn't any blank or tab character that separates the
   variables, they are stick together. This secondfilespecify the
   variable name and his position:

   VARIABLE NAME      POSITION (COLUMN) INFILE
   var_name_1                 123-123
   var_name_2                 124-125
   var_name_3                 126-126
   ..
   ..
   var_name_N                 512-513 (last positions)

  I am unclear on the format of these positions.  They do not look like
  what I would expect from absolute references in the data.  For instance,
  123-123 may only contain one byte??? which could change for different
  encodings and how you mark line endings.  Frankly, the use of the
  world columns in the header suggests that the data *is* separated by
  line endings rather then absolute position and the position refers to
  the line number. In which case, you can use splitlines() to break up
  the data and then address the proper line by index.  Nevertheless,
  you can usefile.seek() to move to an absolute offset in thefile,
  if that really is what you are looking for.

  I work in a survey research firm. the data im talking about has a lot
  of 0-1 variables, meaning yes or no of a lot of questions. so only one
  position of a character is needed (not byte), explaining the 123-123
  kind of positions of a lot of variables.

 Thenfile.seek() is what you are looking for; but, you need to be aware of
 line endings and encodings as indicated.  Make sure that you open thefile
 using whatever encoding was used when it was generated or you could have
 problems with multibyte characters affecting the offsets.

I've tried your advice and something is wrong. Here is my code,



f = open(r'c:c:\somefile.txt', 'w')

f.write('0123456789\n0123456789\n0123456789')

f.close()

f = open(r'c:\somefile.txt', 'r')


for line in f:
f.seek(3,0)
print f.read(1) #just to know if its printing the rigth column

I used .seek() in this manner, but is not working.

Let me put the problem in another way. I have .txt file with NO
headers, and NO blanks between any columns. But i know that from
columns, say 13 to 15, is variable VARNAME_1 (of course, a three digit
var). How can extract that column in a list call VARNAME_1??

Obviously, this should extend to all the positions and variables i
have to extract from the file.

Thanks!

J
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-12-12 Thread Tim Harig
On 2010-12-12, javivd javiervan...@gmail.com wrote:
 On Dec 1, 7:15 am, Tim Harig user...@ilthio.net wrote:
 On 2010-12-01, javivd javiervan...@gmail.com wrote:
  On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote:
  encodings and how you mark line endings.  Frankly, the use of the
  world columns in the header suggests that the data *is* separated by
  line endings rather then absolute position and the position refers to
  the line number. In which case, you can use splitlines() to break up
  the data and then address the proper line by index.  Nevertheless,

^^
Note that I specifically questioned the use of absolute file position vs.
postion within a column.  These are two different things.  You use
different methods to extract each.

  I work in a survey research firm. the data im talking about has a lot
  of 0-1 variables, meaning yes or no of a lot of questions. so only one
  position of a character is needed (not byte), explaining the 123-123
  kind of positions of a lot of variables.

 Thenfile.seek() is what you are looking for; but, you need to be aware of
 line endings and encodings as indicated.  Make sure that you open thefile
 using whatever encoding was used when it was generated or you could have
 problems with multibyte characters affecting the offsets.

 f = open(r'c:c:\somefile.txt', 'w')

I suspect you don't need to use the c: twice.

 f.write('0123456789\n0123456789\n0123456789')

Note that the file you a writing contains three lines.  Is the data that
you are looking for located at an absolute position in the file or on a
position within a individual line?  If the latter, not that line endings
may be composed of more then a single character.

 f.write('0123456789\n0123456789\n0123456789')
  ^ postion 3 using fseek()

 for line in f:

Perhaps you meant:
for character in f.read():
or
for line in f.read().splitlines()

 f.seek(3,0)

This will always take you back to the exact fourth position in the file
(indicated above).

 I used .seek() in this manner, but is not working.

It is working the way it is supposed to.

If you want the absolution position 3 in a file then:

f = open('somefile.txt', 'r')
f.seek(3)
variable = f.read(1)

If you want the absolute position in a column:
f = open('somefile.txt', 'r').read().splitlines()
for column in f:
variable = column[3]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-12-12 Thread Tim Harig
On 2010-12-12, Tim Harig user...@ilthio.net wrote:
 I used .seek() in this manner, but is not working.

 It is working the way it is supposed to.
 If you want the absolute position in a column:

   f = open('somefile.txt', 'r').read().splitlines()
   for column in f:
   variable = column[3]

or:
f = open('somefile.txt', 'r')
for column in f.readlines():
variable = column[3]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-12-03 Thread javivd
On Dec 1, 3:15 am, Tim Harig user...@ilthio.net wrote:
 On 2010-12-01, javivd javiervan...@gmail.com wrote:



  On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote:
  On 2010-11-30, javivd javiervan...@gmail.com wrote:

   I have a case now in wich another file has been provided (besides the
   database) that tells me in wich column of the file is every variable,
   because there isn't any blank or tab character that separates the
   variables, they are stick together. This second file specify the
   variable name and his position:

   VARIABLE NAME      POSITION (COLUMN) IN FILE
   var_name_1                 123-123
   var_name_2                 124-125
   var_name_3                 126-126
   ..
   ..
   var_name_N                 512-513 (last positions)

  I am unclear on the format of these positions.  They do not look like
  what I would expect from absolute references in the data.  For instance,
  123-123 may only contain one byte??? which could change for different
  encodings and how you mark line endings.  Frankly, the use of the
  world columns in the header suggests that the data *is* separated by
  line endings rather then absolute position and the position refers to
  the line number. In which case, you can use splitlines() to break up
  the data and then address the proper line by index.  Nevertheless,
  you can use file.seek() to move to an absolute offset in the file,
  if that really is what you are looking for.

  I work in a survey research firm. the data im talking about has a lot
  of 0-1 variables, meaning yes or no of a lot of questions. so only one
  position of a character is needed (not byte), explaining the 123-123
  kind of positions of a lot of variables.

 Then file.seek() is what you are looking for; but, you need to be aware of
 line endings and encodings as indicated.  Make sure that you open the file
 using whatever encoding was used when it was generated or you could have
 problems with multibyte characters affecting the offsets.

Ok, I will try it and let you know. Thanks all!!
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Reading by positions plain text files

2010-12-01 Thread Javier Van Dam

Ok. I will try it and let you know. Thanks a lot!!

J

 Date: Tue, 30 Nov 2010 20:32:56 -0600
 From: python.l...@tim.thechases.com
 To: javiervan...@gmail.com
 CC: python-list@python.org
 Subject: Re: Reading by positions plain text files
 
 On 11/30/2010 08:03 PM, javivd wrote:
  On Nov 30, 11:43 pm, Tim Hariguser...@ilthio.net  wrote:
  VARIABLE NAME  POSITION (COLUMN) IN FILE
  var_name_1 123-123
  var_name_2 124-125
  var_name_3 126-126
  ..
  ..
  var_name_N 512-513 (last positions)
 
  and no, MRAB, it's not the similar problem (at least what i understood
  of it). I have to associate the position this file give me with the
  variable name this file give me for those positions.
 
 MRAB may be referring to my reply in that thread where you can do 
 something like
 
OFFSETS = 'offsets.txt'
offsets = {}
f = file(OFFSETS)
f.next() # throw away the headers
for row in f:
  varname, rest = row.split()[:2]
  # sanity check
  if varname in offsets:
print [%s] in %s twice?! % (varname, OFFSETS)
  if '-' not in rest: continue
  start, stop = map(int, rest.split('-'))
  offsets[varname] = slice(start, stop+1) # 0-based offsets
  #offsets[varname] = slice(start+1, stop+2) # 1-based offsets
f.close()
 
def do_something_with(data):
  # your real code goes here
  print data['var_name_2']
 
for row in file('data.txt'):
  data = dict((name, row[offsets[name]]) for name in offsets)
  do_something_with(data)
 
 There's additional robustness-checks I'd include if your 
 offsets-file isn't controlled by you (people send me daft data).
 
 -tkc
 
 
 
 
  -- 
http://mail.python.org/mailman/listinfo/python-list


Reading by positions plain text files

2010-11-30 Thread javivd
Hi all,

Sorry, newbie question:

I have database in a plain text file (could be .txt or .dat, it's the
same) that I need to read in python in order to do some data
validation. In other files I read this kind of files with the split()
method, reading line by line. But split() relies on a separator
character (I think... all I know is that it's work OK).

I have a case now in wich another file has been provided (besides the
database) that tells me in wich column of the file is every variable,
because there isn't any blank or tab character that separates the
variables, they are stick together. This second file specify the
variable name and his position:


VARIABLE NAME   POSITION (COLUMN) IN FILE
var_name_1  123-123
var_name_2  124-125
var_name_3  126-126
..
..
var_name_N  512-513 (last positions)

How can I read this so each position in the file it's associated with
each variable name?

Thanks a lot!!

Javier

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-11-30 Thread Tim Harig
On 2010-11-30, javivd javiervan...@gmail.com wrote:
 I have a case now in wich another file has been provided (besides the
 database) that tells me in wich column of the file is every variable,
 because there isn't any blank or tab character that separates the
 variables, they are stick together. This second file specify the
 variable name and his position:

 VARIABLE NAME POSITION (COLUMN) IN FILE
 var_name_1123-123
 var_name_2124-125
 var_name_3126-126
 ..
 ..
 var_name_N512-513 (last positions)

I am unclear on the format of these positions.  They do not look like
what I would expect from absolute references in the data.  For instance,
123-123 may only contain one byte??? which could change for different
encodings and how you mark line endings.  Frankly, the use of the
world columns in the header suggests that the data *is* separated by
line endings rather then absolute position and the position refers to
the line number. In which case, you can use splitlines() to break up
the data and then address the proper line by index.  Nevertheless,
you can use file.seek() to move to an absolute offset in the file,
if that really is what you are looking for.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-11-30 Thread MRAB

On 30/11/2010 21:31, javivd wrote:

Hi all,

Sorry, newbie question:

I have database in a plain text file (could be .txt or .dat, it's the
same) that I need to read in python in order to do some data
validation. In other files I read this kind of files with the split()
method, reading line by line. But split() relies on a separator
character (I think... all I know is that it's work OK).

I have a case now in wich another file has been provided (besides the
database) that tells me in wich column of the file is every variable,
because there isn't any blank or tab character that separates the
variables, they are stick together. This second file specify the
variable name and his position:


VARIABLE NAME   POSITION (COLUMN) IN FILE
var_name_1  123-123
var_name_2  124-125
var_name_3  126-126
..
..
var_name_N  512-513 (last positions)

How can I read this so each position in the file it's associated with
each variable name?


It sounds like a similar problem to this:

http://groups.google.com/group/comp.lang.python/browse_thread/thread/53e6f41bfff6/123422d510187dc3?show_docid=123422d510187dc3
--
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-11-30 Thread javivd
On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote:
 On 2010-11-30, javivd javiervan...@gmail.com wrote:

  I have a case now in wich another file has been provided (besides the
  database) that tells me in wich column of the file is every variable,
  because there isn't any blank or tab character that separates the
  variables, they are stick together. This second file specify the
  variable name and his position:

  VARIABLE NAME      POSITION (COLUMN) IN FILE
  var_name_1                 123-123
  var_name_2                 124-125
  var_name_3                 126-126
  ..
  ..
  var_name_N                 512-513 (last positions)

 I am unclear on the format of these positions.  They do not look like
 what I would expect from absolute references in the data.  For instance,
 123-123 may only contain one byte??? which could change for different
 encodings and how you mark line endings.  Frankly, the use of the
 world columns in the header suggests that the data *is* separated by
 line endings rather then absolute position and the position refers to
 the line number. In which case, you can use splitlines() to break up
 the data and then address the proper line by index.  Nevertheless,
 you can use file.seek() to move to an absolute offset in the file,
 if that really is what you are looking for.

I work in a survey research firm. the data im talking about has a lot
of 0-1 variables, meaning yes or no of a lot of questions. so only one
position of a character is needed (not byte), explaining the 123-123
kind of positions of a lot of variables.

and no, MRAB, it's not the similar problem (at least what i understood
of it). I have to associate the position this file give me with the
variable name this file give me for those positions.

thank you both and sorry for my english!

J
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-11-30 Thread MRAB

On 01/12/2010 02:03, javivd wrote:

On Nov 30, 11:43 pm, Tim Hariguser...@ilthio.net  wrote:

On 2010-11-30, javivdjaviervan...@gmail.com  wrote:


I have a case now in wich another file has been provided (besides the
database) that tells me in wich column of the file is every variable,
because there isn't any blank or tab character that separates the
variables, they are stick together. This second file specify the
variable name and his position:



VARIABLE NAME  POSITION (COLUMN) IN FILE
var_name_1 123-123
var_name_2 124-125
var_name_3 126-126
..
..
var_name_N 512-513 (last positions)


I am unclear on the format of these positions.  They do not look like
what I would expect from absolute references in the data.  For instance,
123-123 may only contain one byte??? which could change for different
encodings and how you mark line endings.  Frankly, the use of the
world columns in the header suggests that the data *is* separated by
line endings rather then absolute position and the position refers to
the line number. In which case, you can use splitlines() to break up
the data and then address the proper line by index.  Nevertheless,
you can use file.seek() to move to an absolute offset in the file,
if that really is what you are looking for.


I work in a survey research firm. the data im talking about has a lot
of 0-1 variables, meaning yes or no of a lot of questions. so only one
position of a character is needed (not byte), explaining the 123-123
kind of positions of a lot of variables.

and no, MRAB, it's not the similar problem (at least what i understood
of it). I have to associate the position this file give me with the
variable name this file give me for those positions.

thank you both and sorry for my english!


You just have to parse the second file to build a list (or dict)
containing the name, start position and end position of each variable:

variables = [(var_name_1, 123, 123), ...]

and then work through that list, extracting the data between those
positions in the first file and putting the values in another list (or
dict).

You also need to check whether the positions are 1-based or 0-based
(Python uses 0-based).
--
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-11-30 Thread Tim Chase

On 11/30/2010 08:03 PM, javivd wrote:

On Nov 30, 11:43 pm, Tim Hariguser...@ilthio.net  wrote:

VARIABLE NAME  POSITION (COLUMN) IN FILE
var_name_1 123-123
var_name_2 124-125
var_name_3 126-126
..
..
var_name_N 512-513 (last positions)



and no, MRAB, it's not the similar problem (at least what i understood
of it). I have to associate the position this file give me with the
variable name this file give me for those positions.


MRAB may be referring to my reply in that thread where you can do 
something like


  OFFSETS = 'offsets.txt'
  offsets = {}
  f = file(OFFSETS)
  f.next() # throw away the headers
  for row in f:
varname, rest = row.split()[:2]
# sanity check
if varname in offsets:
  print [%s] in %s twice?! % (varname, OFFSETS)
if '-' not in rest: continue
start, stop = map(int, rest.split('-'))
offsets[varname] = slice(start, stop+1) # 0-based offsets
#offsets[varname] = slice(start+1, stop+2) # 1-based offsets
  f.close()

  def do_something_with(data):
# your real code goes here
print data['var_name_2']

  for row in file('data.txt'):
data = dict((name, row[offsets[name]]) for name in offsets)
do_something_with(data)

There's additional robustness-checks I'd include if your 
offsets-file isn't controlled by you (people send me daft data).


-tkc




--
http://mail.python.org/mailman/listinfo/python-list


Re: Reading by positions plain text files

2010-11-30 Thread Tim Harig
On 2010-12-01, javivd javiervan...@gmail.com wrote:
 On Nov 30, 11:43 pm, Tim Harig user...@ilthio.net wrote:
 On 2010-11-30, javivd javiervan...@gmail.com wrote:

  I have a case now in wich another file has been provided (besides the
  database) that tells me in wich column of the file is every variable,
  because there isn't any blank or tab character that separates the
  variables, they are stick together. This second file specify the
  variable name and his position:

  VARIABLE NAME      POSITION (COLUMN) IN FILE
  var_name_1                 123-123
  var_name_2                 124-125
  var_name_3                 126-126
  ..
  ..
  var_name_N                 512-513 (last positions)

 I am unclear on the format of these positions.  They do not look like
 what I would expect from absolute references in the data.  For instance,
 123-123 may only contain one byte??? which could change for different
 encodings and how you mark line endings.  Frankly, the use of the
 world columns in the header suggests that the data *is* separated by
 line endings rather then absolute position and the position refers to
 the line number. In which case, you can use splitlines() to break up
 the data and then address the proper line by index.  Nevertheless,
 you can use file.seek() to move to an absolute offset in the file,
 if that really is what you are looking for.

 I work in a survey research firm. the data im talking about has a lot
 of 0-1 variables, meaning yes or no of a lot of questions. so only one
 position of a character is needed (not byte), explaining the 123-123
 kind of positions of a lot of variables.

Then file.seek() is what you are looking for; but, you need to be aware of
line endings and encodings as indicated.  Make sure that you open the file
using whatever encoding was used when it was generated or you could have
problems with multibyte characters affecting the offsets.
-- 
http://mail.python.org/mailman/listinfo/python-list