Multiline regex

2010-07-21 Thread Brandon Harris
I'm trying to read in and parse an ascii type file that contains 
information that can span several lines.

Example:

createNode animCurveTU -n "test:master_globalSmooth";
   setAttr ".tan" 9;
   setAttr -s 4 ".ktv[0:3]"  101 0 163 0 169 0 201 0;
   setAttr -s 4 ".kit[3]"  10;
   setAttr -s 4 ".kot[3]"  10;
createNode animCurveTU -n "test:master_res";
   setAttr ".tan" 9;
   setAttr ".ktv[0]"  103 0;
   setAttr ".kot[0]"  5;
createNode animCurveTU -n "test:master_faceRig";
   setAttr ".tan" 9;
   setAttr ".ktv[0]"  103 0;
   setAttr ".kot[0]"  5;

I'm wanting to grab the information out in chunks, so

createNode animCurveTU -n "test:master_faceRig";
   setAttr ".tan" 9;
   setAttr ".ktv[0]"  103 0;
   setAttr ".kot[0]"  5;

would be what my regex would grab.
I'm currently only able to grab out the first line and part of the 
second line, but no more.

regex is as follows

my_regexp = re.compile("createNode\ animCurve.*\n[\t*setAttr.*\n]*")

I've run several variations of this, but none return me all of the 
expected information.


Is there something special that needs to be done to have the regexp grab 
any number of the setAttr lines without specification?


Brandon L. Harris


--
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex

2010-07-21 Thread Rodrick Brown
Slurp the entire file into a string and pick out the fields you need.

Sent from my iPhone 4.

On Jul 21, 2010, at 10:42 AM, Brandon Harris  wrote:

> I'm trying to read in and parse an ascii type file that contains information 
> that can span several lines.
> Example:
> 
> createNode animCurveTU -n "test:master_globalSmooth";
>   setAttr ".tan" 9;
>   setAttr -s 4 ".ktv[0:3]"  101 0 163 0 169 0 201 0;
>   setAttr -s 4 ".kit[3]"  10;
>   setAttr -s 4 ".kot[3]"  10;
> createNode animCurveTU -n "test:master_res";
>   setAttr ".tan" 9;
>   setAttr ".ktv[0]"  103 0;
>   setAttr ".kot[0]"  5;
> createNode animCurveTU -n "test:master_faceRig";
>   setAttr ".tan" 9;
>   setAttr ".ktv[0]"  103 0;
>   setAttr ".kot[0]"  5;
> 
> I'm wanting to grab the information out in chunks, so
> 
> createNode animCurveTU -n "test:master_faceRig";
>   setAttr ".tan" 9;
>   setAttr ".ktv[0]"  103 0;
>   setAttr ".kot[0]"  5;
> 
> would be what my regex would grab.
> I'm currently only able to grab out the first line and part of the second 
> line, but no more.
> regex is as follows
> 
> my_regexp = re.compile("createNode\ animCurve.*\n[\t*setAttr.*\n]*")
> 
> I've run several variations of this, but none return me all of the expected 
> information.
> 
> Is there something special that needs to be done to have the regexp grab any 
> number of the setAttr lines without specification?
> 
> Brandon L. Harris
> 
> 
> -- 
> http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex

2010-07-21 Thread Brandon Harris

what do you mean by slurp the entire file?
I'm trying to use regular expressions because line by line parsing will 
be too slow. And example file would have somewhere in the realm of 6 
million lines of code.


Brandon L. Harris

Rodrick Brown wrote:

Slurp the entire file into a string and pick out the fields you need.

Sent from my iPhone 4.

On Jul 21, 2010, at 10:42 AM, Brandon Harris  wrote:

  

I'm trying to read in and parse an ascii type file that contains information 
that can span several lines.
Example:

createNode animCurveTU -n "test:master_globalSmooth";
  setAttr ".tan" 9;
  setAttr -s 4 ".ktv[0:3]"  101 0 163 0 169 0 201 0;
  setAttr -s 4 ".kit[3]"  10;
  setAttr -s 4 ".kot[3]"  10;
createNode animCurveTU -n "test:master_res";
  setAttr ".tan" 9;
  setAttr ".ktv[0]"  103 0;
  setAttr ".kot[0]"  5;
createNode animCurveTU -n "test:master_faceRig";
  setAttr ".tan" 9;
  setAttr ".ktv[0]"  103 0;
  setAttr ".kot[0]"  5;

I'm wanting to grab the information out in chunks, so

createNode animCurveTU -n "test:master_faceRig";
  setAttr ".tan" 9;
  setAttr ".ktv[0]"  103 0;
  setAttr ".kot[0]"  5;

would be what my regex would grab.
I'm currently only able to grab out the first line and part of the second line, 
but no more.
regex is as follows

my_regexp = re.compile("createNode\ animCurve.*\n[\t*setAttr.*\n]*")

I've run several variations of this, but none return me all of the expected 
information.

Is there something special that needs to be done to have the regexp grab any 
number of the setAttr lines without specification?

Brandon L. Harris


--
http://mail.python.org/mailman/listinfo/python-list



--
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex

2010-07-21 Thread Eknath Venkataramani
On Wed, Jul 21, 2010 at 8:12 PM, Brandon Harris
wrote:

> I'm trying to read in and parse an ascii type file that contains
> information that can span several lines.
>
Do you have to use only regex? If not, I'd certainly suggest 'pyparsing'.
It's a  pleasure to use and very easy on the eye too, if you know what I
mean.

>  I'm wanting to grab the information out in chunks, so
>

-- 
Eknath Venkataramani
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex

2010-07-21 Thread Brandon Harris
At the moment I'm trying to stick with built in python modules to create 
tools for a much larger pipeline on multiple OSes.


Brandon L. Harris


Eknath Venkataramani wrote:



On Wed, Jul 21, 2010 at 8:12 PM, Brandon Harris 
mailto:brandon.har...@reelfx.com>> wrote:


I'm trying to read in and parse an ascii type file that contains
information that can span several lines.

Do you have to use only regex? If not, I'd certainly suggest 
'pyparsing'. It's a  pleasure to use and very easy on the eye too, if 
you know what I mean.


 I'm wanting to grab the information out in chunks, so


--
Eknath Venkataramani


--
http://mail.python.org/mailman/listinfo/python-list


RE: Multiline regex

2010-07-21 Thread Andreas Tawn
> I'm trying to read in and parse an ascii type file that contains
> information that can span several lines.
> Example:
> 
> createNode animCurveTU -n "test:master_globalSmooth";
> setAttr ".tan" 9;
> setAttr -s 4 ".ktv[0:3]"  101 0 163 0 169 0 201 0;
> setAttr -s 4 ".kit[3]"  10;
> setAttr -s 4 ".kot[3]"  10;
> createNode animCurveTU -n "test:master_res";
> setAttr ".tan" 9;
> setAttr ".ktv[0]"  103 0;
> setAttr ".kot[0]"  5;
> createNode animCurveTU -n "test:master_faceRig";
> setAttr ".tan" 9;
> setAttr ".ktv[0]"  103 0;
> setAttr ".kot[0]"  5;
> 
> I'm wanting to grab the information out in chunks, so
> 
> createNode animCurveTU -n "test:master_faceRig";
> setAttr ".tan" 9;
> setAttr ".ktv[0]"  103 0;
> setAttr ".kot[0]"  5;
> 
> would be what my regex would grab.
> I'm currently only able to grab out the first line and part of the
> second line, but no more.
> regex is as follows
> 
> my_regexp = re.compile("createNode\ animCurve.*\n[\t*setAttr.*\n]*")
> 
> I've run several variations of this, but none return me all of the
> expected information.
> 
> Is there something special that needs to be done to have the regexp
> grab
> any number of the setAttr lines without specification?
> 
> Brandon L. Harris

Aren't you making life too complicated for yourself?

blocks = []
for line in yourFile:
if line.startswith("createNode"):
if currentBlock:
blocks.append(currentBlock)
currentBlock = [line]
else:
currentBlock.append(line)
blocks.append(currentBlock)

Cheers,

Drea
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex

2010-07-21 Thread Peter Otten
Brandon Harris wrote:

> I'm trying to read in and parse an ascii type file that contains
> information that can span several lines.
> Example:
> 
> createNode animCurveTU -n "test:master_globalSmooth";
> setAttr ".tan" 9;
> setAttr -s 4 ".ktv[0:3]"  101 0 163 0 169 0 201 0;
> setAttr -s 4 ".kit[3]"  10;
> setAttr -s 4 ".kot[3]"  10;
> createNode animCurveTU -n "test:master_res";
> setAttr ".tan" 9;
> setAttr ".ktv[0]"  103 0;
> setAttr ".kot[0]"  5;
> createNode animCurveTU -n "test:master_faceRig";
> setAttr ".tan" 9;
> setAttr ".ktv[0]"  103 0;
> setAttr ".kot[0]"  5;
> 
> I'm wanting to grab the information out in chunks, so
> 
> createNode animCurveTU -n "test:master_faceRig";
> setAttr ".tan" 9;
> setAttr ".ktv[0]"  103 0;
> setAttr ".kot[0]"  5;
> 
> would be what my regex would grab.
> I'm currently only able to grab out the first line and part of the
> second line, but no more.
> regex is as follows
> 
> my_regexp = re.compile("createNode\ animCurve.*\n[\t*setAttr.*\n]*")
> 
> I've run several variations of this, but none return me all of the
> expected information.
> 
> Is there something special that needs to be done to have the regexp grab
> any number of the setAttr lines without specification?

Groups are marked with parens (...) not brackets [...].

>>> text = """\
... createNode animCurveTU -n "test:master_globalSmooth";
... setAttr ".tan" 9;
... setAttr -s 4 ".ktv[0:3]"  101 0 163 0 169 0 201 0;
... setAttr -s 4 ".kit[3]"  10;
... setAttr -s 4 ".kot[3]"  10;
... createNode animCurveTU -n "test:master_res";
... setAttr ".tan" 9;
... setAttr ".ktv[0]"  103 0;
... setAttr ".kot[0]"  5;
... createNode animCurveTU -n "test:master_faceRig";
... setAttr ".tan" 9;
... setAttr ".ktv[0]"  103 0;
... setAttr ".kot[0]"  5;
... """
>>> for m in re.compile("(createNode 
>>> animCurve.*\n(\s*setAttr.*\n)*)").finditer(text):
... print m.group(1)
... print "-" * 40
...
createNode animCurveTU -n "test:master_globalSmooth";
setAttr ".tan" 9;
setAttr -s 4 ".ktv[0:3]"  101 0 163 0 169 0 201 0;
setAttr -s 4 ".kit[3]"  10;
setAttr -s 4 ".kot[3]"  10;


createNode animCurveTU -n "test:master_res";
setAttr ".tan" 9;
setAttr ".ktv[0]"  103 0;
setAttr ".kot[0]"  5;


createNode animCurveTU -n "test:master_faceRig";
setAttr ".tan" 9;
setAttr ".ktv[0]"  103 0;
setAttr ".kot[0]"  5;



Peter

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex

2010-07-21 Thread Brandon Harris
Could it be that there isn't just that type of data in the file? there 
are many different types, that is just one that I'm trying to grab.


Brandon L. Harris


Andreas Tawn wrote:

I could make it that simple, but that is also incredibly slow and on a
file with several million lines, it takes somewhere in the league of
half an hour to grab all the data. I need this to grab data from many
many file and return the data quickly.

Brandon L. Harris



That's surprising.

I just made a file with 13 million lines of your data (447Mb) and read it with 
my code. It took a little over 36 seconds. There must be something different in 
your set up or the real data you've got.

Cheers,

Drea
  


--
http://mail.python.org/mailman/listinfo/python-list


RE: Multiline regex

2010-07-21 Thread Andreas Tawn
>>> I could make it that simple, but that is also incredibly slow and on
>>> a file with several million lines, it takes somewhere in the league of
>>> half an hour to grab all the data. I need this to grab data from
>>> many many file and return the data quickly.
>>>
>>> Brandon L. Harris
>>>
>> That's surprising.
>>
>> I just made a file with 13 million lines of your data (447Mb) and
>> read it with my code. It took a little over 36 seconds. There must be
>> something different in your set up or the real data you've got.
>>
>> Cheers,
>>
>> Drea
>>
> Could it be that there isn't just that type of data in the file? there
> are many different types, that is just one that I'm trying to grab.
> 
> Brandon L. Harris

I don't see why it would make such a difference.

If your data looks like...


\t
\t
\t

Just change this line...

if line.startswith("createNode"):

to...

if not line.startswith("\t"):

and it won't care what sort of data the file contains.

Processing that data after you've collected it will still take a while, but 
that's the same whichever method you use to read it.

Cheers,

Drea

p.s. Just noticed I hadn't pre-declared the currentBlock list.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex

2010-07-21 Thread Jeremy Sanders
Brandon Harris wrote:

> I'm trying to read in and parse an ascii type file that contains
> information that can span several lines.
> Example:

What about something like this (you need re.MULTILINE):

In [16]: re.findall('^([^ ].*\n([ ].*\n)+)', a, re.MULTILINE)
Out[16]: 
[('createNode animCurveTU -n "test:master_globalSmooth";\nsetAttr ".tan" 
9;\nsetAttr -s 4 ".ktv[0:3]"  101 0 163 0 169 0 201 0;\nsetAttr -s 4 
".kit[3]"  10;\nsetAttr -s 4 ".kot[3]"  10;\n',
  'setAttr -s 4 ".kot[3]"  10;\n'),
 ('createNode animCurveTU -n "test:master_res";\nsetAttr ".tan" 9;\n
setAttr ".ktv[0]"  103 0;\nsetAttr ".kot[0]"  5;\n',
  'setAttr ".kot[0]"  5;\n'),
 ('createNode animCurveTU -n "test:master_faceRig";\nsetAttr ".tan" 9;\n
setAttr ".ktv[0]"  103 0;\n',
  'setAttr ".ktv[0]"  103 0;\n')]

If you blocks start without a space and subsequent lines with a space.

Jeremy


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex

2010-07-21 Thread Steven D'Aprano
On Wed, 21 Jul 2010 10:06:14 -0500, Brandon Harris wrote:

> what do you mean by slurp the entire file? I'm trying to use regular
> expressions because line by line parsing will be too slow.  And example
> file would have somewhere in the realm of 6 million lines of code.

And you think trying to run a regex over all 6 million lines at once will 
be faster? I think you're going to be horribly, horribly disappointed.


And then on Wed, 21 Jul 2010 10:42:11 -0500, Brandon Harris wrote:

> I could make it that simple, but that is also incredibly slow and on a
> file with several million lines, it takes somewhere in the league of
> half an hour to grab all the data. I need this to grab data from many
> many file and return the data quickly.

What do you mean "grab" all the data? If all you mean is read the file, 
then 30 minutes to read ~ 100MB of data is incredibly slow and you're 
probably doing something wrong, or you're reading it over a broken link 
with very high packet loss, or something.

If you mean read the data AND parse it, then whether that is "incredibly 
slow" or "amazingly fast" depends entirely on how complicated your parser 
needs to be.

If *all* you mean is "read the file and group the lines, for later 
processing", then I would expect it to take well under a minute to group 
millions of lines. Here's a simulation I ran, using 2001000 lines of text 
based on the examples you gave. It grabs the blocks, as required, but 
does no further parsing of them.


def merge(lines):
"""Join multiple lines into a single block."""
accumulator = []
for line in lines:
if line.lower().startswith('createnode'):
if accumulator:
yield ''.join(accumulator)
accumulator = []
accumulator.append(line)
if accumulator:
yield ''.join(accumulator)


def test():
import time
t = time.time()
count = 0
f = open('/steve/test.junk')
for block in merge(f):
# do some make-work
n = sum([1 for c in block if c in '1234567890'])
count += 1
print "Processed %d blocks from 2M+ lines." % count
print "Time taken:", time.time() - t, "seconds"


And the result on a low-end PC:

>>> test()
Processed 1000 blocks from 2M+ lines.
Time taken: 17.4497909546 seconds



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Multiline regex help

2005-03-03 Thread Yatima
Hey Folks,

I've got some info in a bunch of files that kind of looks like so:

Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34

and so on...

Anyhow, these "fields" repeat several times in a given file (number of
repetitions varies from file to file). The number on the line following the
"RelevantInfo" lines is really what I'm after. Ideally, I would like to have
something like so:

RelevantInfo1 = 10/10/04 # The variable name isn't actually important
RelevantInfo3 = 23   # it's just there to illustrate what info I'm
 # trying to snag.

Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2

Collected from all of the files.

So, there would be several of these "scores" per file and there are a bunch
of files. Ultimately, I am interested in printing them out as a csv file but
that should be relatively easy once they are trapped in my array of doom
.

I've got a fairly ugly "solution" (I am using this term *very* loosely)
using awk and his faithfail companion sed, but I would prefer something in
python.

Thanks for your time.

-- 
McGowan's Madison Avenue Axiom:
If an item is advertised as "under $50", you can bet it's not $19.95.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: RE: Multiline regex

2010-07-21 Thread Brandon Harris
I could make it that simple, but that is also incredibly slow and on a 
file with several million lines, it takes somewhere in the league of 
half an hour to grab all the data. I need this to grab data from many 
many file and return the data quickly.


Brandon L. Harris


Andreas Tawn wrote:

I'm trying to read in and parse an ascii type file that contains
information that can span several lines.
Example:

createNode animCurveTU -n "test:master_globalSmooth";
setAttr ".tan" 9;
setAttr -s 4 ".ktv[0:3]"  101 0 163 0 169 0 201 0;
setAttr -s 4 ".kit[3]"  10;
setAttr -s 4 ".kot[3]"  10;
createNode animCurveTU -n "test:master_res";
setAttr ".tan" 9;
setAttr ".ktv[0]"  103 0;
setAttr ".kot[0]"  5;
createNode animCurveTU -n "test:master_faceRig";
setAttr ".tan" 9;
setAttr ".ktv[0]"  103 0;
setAttr ".kot[0]"  5;

I'm wanting to grab the information out in chunks, so

createNode animCurveTU -n "test:master_faceRig";
setAttr ".tan" 9;
setAttr ".ktv[0]"  103 0;
setAttr ".kot[0]"  5;

would be what my regex would grab.
I'm currently only able to grab out the first line and part of the
second line, but no more.
regex is as follows

my_regexp =e.compile("createNode\ animCurve.*\n[\t*setAttr.*\n]*")

I've run several variations of this, but none return me all of the
expected information.

Is there something special that needs to be done to have the regexp
grab
any number of the setAttr lines without specification?

Brandon L. Harris



Aren't you making life too complicated for yourself?

blocks =]
for line in yourFile:
if line.startswith("createNode"):
if currentBlock:
blocks.append(currentBlock)
currentBlock =line]
else:
currentBlock.append(line)
blocks.append(currentBlock)

Cheers,

Drea

  


--
http://mail.python.org/mailman/listinfo/python-list


RE: RE: Multiline regex

2010-07-21 Thread Andreas Tawn
> I could make it that simple, but that is also incredibly slow and on a
> file with several million lines, it takes somewhere in the league of
> half an hour to grab all the data. I need this to grab data from many
> many file and return the data quickly.
> 
> Brandon L. Harris

That's surprising.

I just made a file with 13 million lines of your data (447Mb) and read it with 
my code. It took a little over 36 seconds. There must be something different in 
your set up or the real data you've got.

Cheers,

Drea
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Kent Johnson
Yatima wrote:
Hey Folks,
I've got some info in a bunch of files that kind of looks like so:
Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34
and so on...
Anyhow, these "fields" repeat several times in a given file (number of
repetitions varies from file to file). The number on the line following the
"RelevantInfo" lines is really what I'm after. Ideally, I would like to have
something like so:
RelevantInfo1 = 10/10/04 # The variable name isn't actually important
RelevantInfo3 = 23   # it's just there to illustrate what info I'm
 # trying to snag.
Here is a way to create a list of [RelevantInfo, value] pairs:
import cStringIO
raw_data = '''Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34'''
raw_data = cStringIO.StringIO(raw_data)
data = []
for line in raw_data:
if line.startswith('RelevantInfo'):
key = line.strip()
value = raw_data.next().strip()
data.append([key, value])
print data

Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2
I'm not sure what you mean by this. Do you want to build a Score dictionary 
as well?
Kent
Collected from all of the files.
So, there would be several of these "scores" per file and there are a bunch
of files. Ultimately, I am interested in printing them out as a csv file but
that should be relatively easy once they are trapped in my array of doom
.
I've got a fairly ugly "solution" (I am using this term *very* loosely)
using awk and his faithfail companion sed, but I would prefer something in
python.
Thanks for your time.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Steven Bethard
Yatima wrote:
Hey Folks,
I've got some info in a bunch of files that kind of looks like so:
Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34
and so on...
Anyhow, these "fields" repeat several times in a given file (number of
repetitions varies from file to file). The number on the line following the
"RelevantInfo" lines is really what I'm after. Ideally, I would like to have
something like so:
RelevantInfo1 = 10/10/04 # The variable name isn't actually important
RelevantInfo3 = 23   # it's just there to illustrate what info I'm
 # trying to snag.
Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2
A possible solution, using the re module:
py> s = """\
... Gibberish
... 53
... MoreGarbage
... 12
... RelevantInfo1
... 10/10/04
... NothingImportant
... ThisDoesNotMatter
... 44
... RelevantInfo2
... 22
... BlahBlah
... 343
... RelevantInfo3
... 23
... Hubris
... Crap
... 34
... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
....*
...^RelevantInfo2\n([^\n]*)
....*
...^RelevantInfo3\n([^\n]*)""",
...re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
... score.setdefault(info1, {})[info3] = info2
...
py> score
{'10/10/04': {'23': '22'}}
Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE 
to have ^ apply at the start of each line, and VERBOSE to allow me to 
write the re in a more readable form.

If I didn't get your dict update quite right, hopefully you can see how 
to fix it!

HTH,
STeVe
--
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Yatima
On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard <[EMAIL PROTECTED]> wrote:
>
> A possible solution, using the re module:
>
> py> s = """\
> ... Gibberish
> ... 53
> ... MoreGarbage
> ... 12
> ... RelevantInfo1
> ... 10/10/04
> ... NothingImportant
> ... ThisDoesNotMatter
> ... 44
> ... RelevantInfo2
> ... 22
> ... BlahBlah
> ... 343
> ... RelevantInfo3
> ... 23
> ... Hubris
> ... Crap
> ... 34
> ... """
> py> import re
> py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
> ....*
> ...^RelevantInfo2\n([^\n]*)
> ....*
> ...^RelevantInfo3\n([^\n]*)""",
> ...re.DOTALL | re.MULTILINE | re.VERBOSE)
> py> score = {}
> py> for info1, info2, info3 in m.findall(s):
> ... score.setdefault(info1, {})[info3] = info2
> ...
> py> score
> {'10/10/04': {'23': '22'}}
>
> Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE 
> to have ^ apply at the start of each line, and VERBOSE to allow me to 
> write the re in a more readable form.
>
> If I didn't get your dict update quite right, hopefully you can see how 
> to fix it!

Thanks! That was very helpful. Unfortunately, I wasn't completely clear when
describing the problem. Is there anyway to extract multiple scores from the
same file and from multiple files (I will probably use the "fileinput"
module to deal with multiple files). So, if I've got say:

Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34

SecondSetofGarbage
2423
YouGetThePicture
342342
RelevantInfo1
10/10/04
HoHum
343
MoreStuffNotNeeded
232
RelevantInfo2
33
RelevantInfo3
44
sdfsdf
RelevantInfo1
10/11/04
InsertBoringFillerHere
43234
Stuff
MoreStuff
RelevantInfo2
45
ExcitingIsntIt
324234
RelevantInfo3
60
Lalala

Sorry for the long and painful example input. Notice that the first two
"RelevantInfo1" fields have the same info but that the RelevantInfo2 and
RelevantInfo3 fields have different info. Also, there will be cases where
RelevantInfo3 might be the same with a different RelevantInfo2. What, I'm
hoping for is something along then lines of being able to organize it like
so (don't worry about the format of the output -- I'll deal with that
later; "RelevantInfo" shortened to "Info" for readability):

Info1[0],   Info[1],Info[2] ...
Info3[0]Info2[Info1[0],Info3[0]]Info2[Info1[1],Info3[1]]...
Info3[1]Info2[Info1[0],Info3[1]]...
Info3[2]Info2[Info1[0],Info3[2]]...
...

I don't really care if it's a list, dictionary, array etc. 

Thanks again for your help. The multiline option in the re module is very
useful. 

Take care.

-- 
Clarke's Conclusion:
Never let your sense of morals interfere with doing the right thing.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread James Stroud
Have a look at "martel", part of biopython. The world of bioinformatics is 
filled with files with structure like this.

http://www.biopython.org/docs/api/public/Martel-module.html

James

On Thursday 03 March 2005 12:03 pm, Yatima wrote:
> On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard 
<[EMAIL PROTECTED]> wrote:
> > A possible solution, using the re module:
> >
> > py> s = """\
> > ... Gibberish
> > ... 53
> > ... MoreGarbage
> > ... 12
> > ... RelevantInfo1
> > ... 10/10/04
> > ... NothingImportant
> > ... ThisDoesNotMatter
> > ... 44
> > ... RelevantInfo2
> > ... 22
> > ... BlahBlah
> > ... 343
> > ... RelevantInfo3
> > ... 23
> > ... Hubris
> > ... Crap
> > ... 34
> > ... """
> > py> import re
> > py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
> > ....*
> > ...^RelevantInfo2\n([^\n]*)
> > ....*
> > ...^RelevantInfo3\n([^\n]*)""",
> > ...re.DOTALL | re.MULTILINE | re.VERBOSE)
> > py> score = {}
> > py> for info1, info2, info3 in m.findall(s):
> > ... score.setdefault(info1, {})[info3] = info2
> > ...
> > py> score
> > {'10/10/04': {'23': '22'}}
> >
> > Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
> > to have ^ apply at the start of each line, and VERBOSE to allow me to
> > write the re in a more readable form.
> >
> > If I didn't get your dict update quite right, hopefully you can see how
> > to fix it!
>
> Thanks! That was very helpful. Unfortunately, I wasn't completely clear
> when describing the problem. Is there anyway to extract multiple scores
> from the same file and from multiple files (I will probably use the
> "fileinput" module to deal with multiple files). So, if I've got say:
>
> Gibberish
> 53
> MoreGarbage
> 12
> RelevantInfo1
> 10/10/04
> NothingImportant
> ThisDoesNotMatter
> 44
> RelevantInfo2
> 22
> BlahBlah
> 343
> RelevantInfo3
> 23
> Hubris
> Crap
> 34
>
> SecondSetofGarbage
> 2423
> YouGetThePicture
> 342342
> RelevantInfo1
> 10/10/04
> HoHum
> 343
> MoreStuffNotNeeded
> 232
> RelevantInfo2
> 33
> RelevantInfo3
> 44
> sdfsdf
> RelevantInfo1
> 10/11/04
> InsertBoringFillerHere
> 43234
> Stuff
> MoreStuff
> RelevantInfo2
> 45
> ExcitingIsntIt
> 324234
> RelevantInfo3
> 60
> Lalala
>
> Sorry for the long and painful example input. Notice that the first two
> "RelevantInfo1" fields have the same info but that the RelevantInfo2 and
> RelevantInfo3 fields have different info. Also, there will be cases where
> RelevantInfo3 might be the same with a different RelevantInfo2. What, I'm
> hoping for is something along then lines of being able to organize it like
> so (don't worry about the format of the output -- I'll deal with that
> later; "RelevantInfo" shortened to "Info" for readability):
>
> Info1[0],   Info[1],Info[2]
> ... Info3[0]Info2[Info1[0],Info3[0]]Info2[Info1[1],Info3[1]]...
> Info3[1]Info2[Info1[0],Info3[1]]...
> Info3[2]Info2[Info1[0],Info3[2]]...
> ...
>
> I don't really care if it's a list, dictionary, array etc.
>
> Thanks again for your help. The multiline option in the re module is very
> useful.
>
> Take care.
>
> --
> Clarke's Conclusion:
>   Never let your sense of morals interfere with doing the right thing.

-- 
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Yatima
On Thu, 03 Mar 2005 07:14:50 -0500, Kent Johnson <[EMAIL PROTECTED]> wrote:
>
> Here is a way to create a list of [RelevantInfo, value] pairs:
> import cStringIO
>
> raw_data = '''Gibberish
> 53
> MoreGarbage
> 12
> RelevantInfo1
> 10/10/04
> NothingImportant
> ThisDoesNotMatter
> 44
> RelevantInfo2
> 22
> BlahBlah
> 343
> RelevantInfo3
> 23
> Hubris
> Crap
> 34'''
> raw_data = cStringIO.StringIO(raw_data)
>
> data = []
> for line in raw_data:
>  if line.startswith('RelevantInfo'):
>  key = line.strip()
>  value = raw_data.next().strip()
>  data.append([key, value])
>
> print data
>

Thank you. This isn't exactly what I'm looking for (I wasn't clear in
describing the problem -- please see my reply to Steve for a, hopefully,
better explanation) but it does give me a few ideas.
>
>> 
>> Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2
>
> I'm not sure what you mean by this. Do you want to build a Score dictionary 
> as well?

Sure... Uhhh.. I think. Okay, what I want is some kind of awk-like
associative array because the raw data files will have repeats for certain
field vaues such that there would be, for example, multiple RelevantInfo2's
and RelevantInfo3's for the same RelevantInfo1 (i.e. on the same date). To
make matters more exciting, there will be multiple RelevantInfo1's (dates)
for the same RelevantInfo3 (e.g. a subject ID). RelevantInfo2 will be the
value for all unique combinations of RelevantInfo1 and RelevantInfo3. There
will be multiple occurrences of these fields in the same file (original data
sample was not very good for this reason) and multiple files as well. The
interesting three fields will always be repeated in the same order although
the amount of irrelevant data in between may vary. So:

RelevantInfo1
10/10/04

RelevantInfo2
12

RelevantInfo3
43

RelevantInfo1
10/10/04<- The same as the first occurrence of RelevantInfo1

RelevantInfo2
22

RelevantInfo3
25

RelevantInfo1
10/11/04

RelevantInfo2
34

RelevantInfo3
28

RelevantInfo1
10/12/04

RelevantInfo2
98

RelevantInfo3
25<- The same as the second occurrence of RelevantInfo3
...

Sorry for the long and tedious "data" example.

There will be missing values for some combinations of RelevantInfo1 and
RelevantInfo3 so hopefully that won't be an issue.

Thanks again for your reply.

Take care.

-- 
"I figured there was this holocaust, right, and the only ones left alive were
 Donna Reed, Ozzie and Harriet, and the Cleavers."
-- Wil Wheaton explains why everyone in "Star Trek: The Next Generation" 
is so nice
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread James Stroud
I found the original paper for Martel:

http://www.dalkescientific.com/Martel/ipc9/

On Thursday 03 March 2005 12:26 pm, James Stroud wrote:
> Have a look at "martel", part of biopython. The world of bioinformatics is
> filled with files with structure like this.
>
> http://www.biopython.org/docs/api/public/Martel-module.html
>
> James
>
> On Thursday 03 March 2005 12:03 pm, Yatima wrote:

-- 
James Stroud, Ph.D.
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Steven Bethard
Yatima wrote:
On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard <[EMAIL PROTECTED]> wrote:
A possible solution, using the re module:
py> s = """\
... Gibberish
... 53
... MoreGarbage
... 12
... RelevantInfo1
... 10/10/04
... NothingImportant
... ThisDoesNotMatter
... 44
... RelevantInfo2
... 22
... BlahBlah
... 343
... RelevantInfo3
... 23
... Hubris
... Crap
... 34
... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
....*
...^RelevantInfo2\n([^\n]*)
....*
...^RelevantInfo3\n([^\n]*)""",
...re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
... score.setdefault(info1, {})[info3] = info2
...
py> score
{'10/10/04': {'23': '22'}}
Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE 
to have ^ apply at the start of each line, and VERBOSE to allow me to 
write the re in a more readable form.

If I didn't get your dict update quite right, hopefully you can see how 
to fix it!

Thanks! That was very helpful. Unfortunately, I wasn't completely clear when
describing the problem. Is there anyway to extract multiple scores from the
same file and from multiple files
I think if you use the non-greedy .*? instead of the greedy .*, you'll 
get this behavior.  For example:

py> s = """\
... Gibberish
... 53
... MoreGarbage
[snip a whole bunch of stuff]
... RelevantInfo3
... 60
... Lalala
... """
py> import re
py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
....*?
...^RelevantInfo2\n([^\n]*)
....*?
...^RelevantInfo3\n([^\n]*)""",
...re.DOTALL | re.MULTILINE | re.VERBOSE)
py> score = {}
py> for info1, info2, info3 in m.findall(s):
... score.setdefault(info1, {})[info3] = info2
...
py> score
{'10/10/04': {'44': '33', '23': '22'}, '10/11/04': {'60': '45'}}
If you might have multiple info2 values for the same (info1, info3) 
pair, you can try something like:

py> score = {}
py> for info1, info2, info3 in m.findall(s):
... score.setdefault(info1, {}).setdefault(info3, []).append(info2)
...
py> score
{'10/10/04': {'44': ['33'], '23': ['22']}, '10/11/04': {'60': ['45']}}
HTH,
STeVe
--
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Kent Johnson
Here is another attempt. I'm still not sure I understand what form you want the data in. I made a 
dict -> dict -> list structure so if you lookup e.g. scores['10/11/04']['60'] you get a list of all 
the RelevantInfo2 values for Relevant1='10/11/04' and Relevant2='60'.

The parser is a simple-minded state machine that will misbehave if the input does not have entries 
in the order Relevant1, Relevant2, Relevant3 (with as many intervening lines as you like).

All three values are available when Relevant3 is detected so you could do something else with them 
if you want.

HTH
Kent
import cStringIO
raw_data = '''Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34
Gibberish
53
MoreGarbage
12
RelevantInfo1
10/10/04
NothingImportant
ThisDoesNotMatter
44
RelevantInfo2
22
BlahBlah
343
RelevantInfo3
23
Hubris
Crap
34
SecondSetofGarbage
2423
YouGetThePicture
342342
RelevantInfo1
10/10/04
HoHum
343
MoreStuffNotNeeded
232
RelevantInfo2
33
RelevantInfo3
44
sdfsdf
RelevantInfo1
10/11/04
InsertBoringFillerHere
43234
Stuff
MoreStuff
RelevantInfo2
45
ExcitingIsntIt
324234
RelevantInfo3
60
Lalala'''
raw_data = cStringIO.StringIO(raw_data)
scores = {}
info1 = info2 = info3 = None
for line in raw_data:
if line.startswith('RelevantInfo1'):
info1 = raw_data.next().strip()
elif line.startswith('RelevantInfo2'):
info2 = raw_data.next().strip()
elif line.startswith('RelevantInfo3'):
info3 = raw_data.next().strip()
scores.setdefault(info1, {}).setdefault(info3, []).append(info2)
info1 = info2 = info3 = None
print scores
print scores['10/11/04']['60']
print scores['10/10/04']['23']
## prints:
{'10/10/04': {'44': ['33'], '23': ['22', '22']}, '10/11/04': {'60': ['45']}}
['45']
['22', '22']
--
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Yatima
On Thu, 03 Mar 2005 13:45:31 -0700, Steven Bethard <[EMAIL PROTECTED]> wrote:
>
> I think if you use the non-greedy .*? instead of the greedy .*, you'll 
> get this behavior.  For example:
>
> py> s = """\
> ... Gibberish
> ... 53
> ... MoreGarbage
> [snip a whole bunch of stuff]
> ... RelevantInfo3
> ... 60
> ... Lalala
> ... """
> py> import re
> py> m = re.compile(r"""^RelevantInfo1\n([^\n]*)
> ....*?
> ...^RelevantInfo2\n([^\n]*)
> ....*?
> ...^RelevantInfo3\n([^\n]*)""",
> ...re.DOTALL | re.MULTILINE | re.VERBOSE)
> py> score = {}
> py> for info1, info2, info3 in m.findall(s):
> ... score.setdefault(info1, {})[info3] = info2
> ...
> py> score
> {'10/10/04': {'44': '33', '23': '22'}, '10/11/04': {'60': '45'}}
>
> If you might have multiple info2 values for the same (info1, info3) 
> pair, you can try something like:
>
> py> score = {}
> py> for info1, info2, info3 in m.findall(s):
> ... score.setdefault(info1, {}).setdefault(info3, []).append(info2)
> ...
> py> score
> {'10/10/04': {'44': ['33'], '23': ['22']}, '10/11/04': {'60': ['45']}}
>
Perfect! Thank you so much. This is the behaviour I'm looking for. I will
fiddle around with this some more tonight but the rest should be okay.

Take care.

-- 
Of course power tools and alcohol don't mix.  Everyone knows power
tools aren't soluble in alcohol...
-- Crazy Nigel
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Yatima
On Thu, 03 Mar 2005 16:25:39 -0500, Kent Johnson <[EMAIL PROTECTED]> wrote:
> Here is another attempt. I'm still not sure I understand what form you want 
> the data in. I made a 
> dict -> dict -> list structure so if you lookup e.g. scores['10/11/04']['60'] 
> you get a list of all 
> the RelevantInfo2 values for Relevant1='10/11/04' and Relevant2='60'.
>
> The parser is a simple-minded state machine that will misbehave if the input 
> does not have entries 
> in the order Relevant1, Relevant2, Relevant3 (with as many intervening lines 
> as you like).
>
> All three values are available when Relevant3 is detected so you could do 
> something else with them 
> if you want.
>
> HTH
> Kent
>
> import cStringIO
>
> raw_data = '''Gibberish
> 53
> MoreGarbage
[mass snippage]
> 60
> Lalala'''
> raw_data = cStringIO.StringIO(raw_data)
>
> scores = {}
> info1 = info2 = info3 = None
>
> for line in raw_data:
>  if line.startswith('RelevantInfo1'):
>  info1 = raw_data.next().strip()
>  elif line.startswith('RelevantInfo2'):
>  info2 = raw_data.next().strip()
>  elif line.startswith('RelevantInfo3'):
>  info3 = raw_data.next().strip()
>  scores.setdefault(info1, {}).setdefault(info3, []).append(info2)
>  info1 = info2 = info3 = None
>
> print scores
> print scores['10/11/04']['60']
> print scores['10/10/04']['23']
>
> ## prints:
> {'10/10/04': {'44': ['33'], '23': ['22', '22']}, '10/11/04': {'60': ['45']}}
> ['45']
> ['22', '22']

Thank you so much. Your solution and Steve's both give me what I'm looking
for. I appreciate both of your incredibly quick replies!

Take care.

-- 
You worry too much about your job.  Stop it.  You are not paid enough to worry.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Yatima
On Thu, 3 Mar 2005 12:26:37 -0800, James Stroud <[EMAIL PROTECTED]> wrote:
> Have a look at "martel", part of biopython. The world of bioinformatics is 
> filled with files with structure like this.
>
> http://www.biopython.org/docs/api/public/Martel-module.html
>
> James

Thanks for the link. Steve and Kent have provided me with nice solutions but
I will check this out anyways for future referenced.

Take care.

-- 
You may easily play a joke on a man who likes to argue -- agree with him.
-- Ed Howe
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Steven Bethard
Kent Johnson wrote:
for line in raw_data:
if line.startswith('RelevantInfo1'):
info1 = raw_data.next().strip()
elif line.startswith('RelevantInfo2'):
info2 = raw_data.next().strip()
elif line.startswith('RelevantInfo3'):
info3 = raw_data.next().strip()
scores.setdefault(info1, {}).setdefault(info3, []).append(info2)
info1 = info2 = info3 = None
Very pretty. =)  I have to say, I hadn't ever used iterators this way 
before, that is, calling their next method from within a for-loop.  I 
like it. =)

Thanks for opening my mind. ;)
STeVe
--
http://mail.python.org/mailman/listinfo/python-list


Re: Multiline regex help

2005-03-03 Thread Kent Johnson
Steven Bethard wrote:
Kent Johnson wrote:
for line in raw_data:
if line.startswith('RelevantInfo1'):
info1 = raw_data.next().strip()
elif line.startswith('RelevantInfo2'):
info2 = raw_data.next().strip()
elif line.startswith('RelevantInfo3'):
info3 = raw_data.next().strip()
scores.setdefault(info1, {}).setdefault(info3, []).append(info2)
info1 = info2 = info3 = None

Very pretty. =)  I have to say, I hadn't ever used iterators this way 
before, that is, calling their next method from within a for-loop.  I 
like it. =)
I confess I have a nagging suspicion that someone who actually knows something about CPython 
internals will tell me why it's a bad idea...but it sure is handy!

Thanks for opening my mind. ;)
My pleasure :-)
Kent
--
http://mail.python.org/mailman/listinfo/python-list