subject:"String parsing"

Re: C API String Parsing/Returning

2009-04-06 Thread k3xji

Whan I run the following function, I see a mem leak, a 20 mb of memory
is allocated and is not freed. Here is the code I run:

>>> import esauth
>>> for i in range(100):
... ss = esauth.penc('sumer')
...
>>> for i in range(100):
... ss = esauth.penc('sumer')
...

And here is the penc() function.


static PyObject *
penc(PyObject *self, PyObject *args)
{
unsigned char *s= NULL;
unsigned char *buf = NULL;
PyObject * result = NULL;
unsigned int v,len,i = 0;

if (!PyArg_ParseTuple(args, "s#", &s, &len))
return NULL;

buf = strdup(s);
if (!buf) {
PyErr_SetString(PyExc_MemoryError,
"Out of memory: strdup failed");
return NULL;
}


/*string manipulation*/


result = PyString_FromString(buf);
free(buf);
return result;
}


Am I doing something wrong?

Thanks,
--
http://mail.python.org/mailman/listinfo/python-list

Re: C API String Parsing/Returning

2009-04-06 Thread Christian Heimes

Gerhard Häring wrote:
> char* buf = strdup(s);
> if (!buf) {
> PyErr_SetString(PyExc_MemoryError, "Out of memory: strdup failed");
> return NULL;
> }
> 
> /* TODO: your string manipulation */

Don't forget to free(buf). ;)

Christian

--
http://mail.python.org/mailman/listinfo/python-list

Re: C API String Parsing/Returning

2009-04-06 Thread Gerhard Häring

k3xji wrote:
> Hi all,
> 
> This might be a newbie question. I am trying to implement a simple
> string decoder/encoder algorithm. Just suppose I am substrcating some
> values from the string passed as a parameter to the function and I
> want the function to return encoded/decoded version of the string.
> 
> Here is the call:
> ss= esauth.penc('s')
> st = esauth.pdec(ss)
> 
> static PyObject *
> pdec(PyObject *self, PyObject *args)
> {
>   unsigned char *s= NULL;
> 
>   unsigned int v,len,i = 0;
> 
>   if (!PyArg_ParseTuple(args, "s", &s))
> return NULL;

>   if (!s)
>   return NULL;

These two lines are superfluous. s now points to the contents of the
Python string (which must not contain any 0 characters, else a TypeError
is raised instead). Python strings are immutable, so you should *not
modify this C string*.

>   len = strlen(s);
> 
>   for(i=0;i if (s[i] > 10)
>   s[i] = s[i] - 10;
>   }
> 
>   return Py_BuildValue("s",s);
> }
> 
> 
> This is returning the original string. I mean the parameter is changed
> but the Py_BuildValue is returning the original string passed in as
> param. [...]

Yes, that's because you're returning a Python string from the string
passed in ;-)

You should do something else instead:

char* buf = strdup(s);
if (!buf) {
PyErr_SetString(PyExc_MemoryError, "Out of memory: strdup failed");
return NULL;
}

/* TODO: your string manipulation */

return PyString_FromString(buf); /* return Py_BuildValue("s", buf); */

If you want to cope with Python strings that may contain 0 bytes, parse
them with "s#" instead. This should normally be better because you avoid
the strlen() this way.

HTH

-- Gerhard

--
http://mail.python.org/mailman/listinfo/python-list

Re: C API String Parsing/Returning

2009-04-06 Thread k3xji

Sorry, Here is the correct output:
>>> ss= esauth.penc('s')
>>> print ss
╣
>>> esauth.pdec(ss)
'\xb9'
>>> print ss
s --> Works fine!!!
>>> ss= esauth.penc('s')
>>> print ss
s
>>> ss = esauth.pdec(ss)
>>> print ss
╣  --> how did this happen if the param and return values are same? I
cannot understand this. Something has todo with ref counts but I don't
understand the problem.
>>>



On Apr 6, 3:13 pm, k3xji  wrote:
> Hi all,
>
> This might be a newbie question. I am trying to implement a simple
> string decoder/encoder algorithm. Just suppose I am substrcating some
> values from the string passed as a parameter to the function and I
> want the function to return encoded/decoded version of the string.
>
> Here is the call:
> ss= esauth.penc('s')
> st = esauth.pdec(ss)
>
> static PyObject *
> pdec(PyObject *self, PyObject *args)
> {
>         unsigned char *s= NULL;
>
>         unsigned int v,len,i = 0;
>
>         if (!PyArg_ParseTuple(args, "s", &s))
>         return NULL;
>         if (!s)
>                 return NULL;
>
>         len = strlen(s);
>
>         for(i=0;i                 if (s[i] > 10)
>                     s[i] = s[i] - 10;
>         }
>
>         return Py_BuildValue("s",s);
>
> }
>
> This is returning the original string. I mean the parameter is changed
> but the Py_BuildValue is returning the original string passed in as
> param.
>
>  have dealt with another nmore complex extension and because of the
> same string handling problems, I just stop implementing it. Can
> somebody please briefly explain the gotchas in Python's string
> handling and returning values, cause I am having real trouble with
> them.
>
> Thanks,

--
http://mail.python.org/mailman/listinfo/python-list

C API String Parsing/Returning

2009-04-06 Thread k3xji

Hi all,

This might be a newbie question. I am trying to implement a simple
string decoder/encoder algorithm. Just suppose I am substrcating some
values from the string passed as a parameter to the function and I
want the function to return encoded/decoded version of the string.

Here is the call:
ss= esauth.penc('s')
st = esauth.pdec(ss)

static PyObject *
pdec(PyObject *self, PyObject *args)
{
unsigned char *s= NULL;

unsigned int v,len,i = 0;

if (!PyArg_ParseTuple(args, "s", &s))
return NULL;
if (!s)
return NULL;

len = strlen(s);

for(i=0;i 10)
s[i] = s[i] - 10;
}

return Py_BuildValue("s",s);
}


This is returning the original string. I mean the parameter is changed
but the Py_BuildValue is returning the original string passed in as
param.

 have dealt with another nmore complex extension and because of the
same string handling problems, I just stop implementing it. Can
somebody please briefly explain the gotchas in Python's string
handling and returning values, cause I am having real trouble with
them.

Thanks,
--
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-05 Thread Steve Holden

Dennis Lee Bieber wrote:
> On Tue, 05 Feb 2008 04:03:04 GMT, Odysseus
> <[EMAIL PROTECTED]> declaimed the following in
> comp.lang.python:
> 
>> Sorry, translation problem: I am acquainted with Python's "for" -- if 
>> far from fluent with it, so to speak -- but the PS operator that's most 
>> similar (traversing a compound object, element by element, without any 
>> explicit indexing or counting) is called "forall". PS's "for" loop is 
>> similar to BASIC's (and ISTR Fortran's):
>>
>> start_value increment end_value {procedure} for
>>
>> I don't know the proper generic term -- "indexed loop"? -- but at any 
>> rate it provides a counter, unlike Python's command of the same name.
>>
>   The convention is Python is to use range() (or xrange() ) to
> generate a sequence of "index" values for the for statement to loop
> over:
> 
>   for i in range([start], end, [step]):
> 
> with the caveat that "end" will not be one of the values, start defaults
> to 0, so if you supply range(4) the values become 0, 1, 2, 3 [ie, 4
> values starting at 0].
>  
If you have a sequence of values s and you want to associate each with 
its index value as you loop over the sequence the easiest way to do this 
is the enumerate built-in function:

 >>> for x in enumerate(['this', 'is', 'a', 'list']):
...   print x
...
(0, 'this')
(1, 'is')
(2, 'a')
(3, 'list')

It's usually (though not always) much more convenient to bind the index 
and the value to separate names, as in

 >>> for i, v in enumerate(['this', 'is', 'a', 'list']):
...   print i, v
...
0 this
1 is
2 a
3 list

[...]
>   The whole idea behind the SGML parser is that YOU add methods to
> handle each tag type you need... Also, FYI, there IS an HTML parser (in
> module htmllib) that is already derived from sgmllib.
> 
> class PageParser(SGMLParser):
>   def __init__(self):
>   #need to call the parent __init__, and then
>   #initialize any needed attributes -- like someplace to collect
>   #the parsed out cell data
>   self.row = {}
>   self.all_data = []
> 
>   def start_table(self, attrs):
>   self.inTable = True
>   .
> 
>   def end_table(self):
>   self.inTable = False
>   .
> 
>   def start_tr(self, attrs):
>   if self.inRow:
>   #unclosed row!
>   self.end_tr()
>   self.inRow = True
>   self.cellCount = 0
>   ...
> 
>   def end_tr(self):
>   self.inRow = False
>   # add/append collected row data to master stuff
>   self.all_data.append(self.row)
>   ...
> 
>   def start_td(self, attrs):
>   if self.inCell:
>   self.end_td()
>   self.inCell = True
>   ...
> 
>   def end_td(self):
>   self.cellCount = self.cellCount + 1
>   ...
> 
>   def handle_data(self, text):
>   if self.inTable and self.inRow and self.inCell:
>   if self.cellCount == 0:
>   #first column stuff
>   self.row["Epoch1"] = convert_if_needed(text)
>   elif self.cellCount == 1:
>   #second column stuff
>   ...
> 
> 
>   Hope you don't have nested tables -- it could get ugly as this style
> of parser requires the start_tag()/end_tag() methods to set instance
> attributes for the purpose of tracking state needed in later methods
> (notice the complexity of the handle_data() method just to ensure that
> the text is from a table cell, and not some random text).
> 
There is, of course, nothing to stop you building a recursive data 
structure, so that encountering a new opening tag such as  adds 
another level to some stack-like object, and the corresponding closing 
tag pops it off again, but this *does* add to the complexity somewhat.

It seems natural that more complex input possibilities lead to more 
complex parsers.

>   And somewhere before you close the parser, get a handle on the
> collected data...
> 
> 
>   parsed_data = parser.all_data
>   parser.close()
>   return parsed_data
> 
> 
>> Why wouldn't one use a dictionary for that?
>>
>   The overhead may not be needed... Tuples can also be used as the
> keys /in/ a dictionary.
>  
regards
  Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-05 Thread Steve Holden

Marc 'BlackJack' Rintsch wrote:
> On Tue, 05 Feb 2008 06:19:12 +, Odysseus wrote:
> 
>> In article <[EMAIL PROTECTED]>,
>>  Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
>>
>>> Another issue is testing.  If you rely on global names it's harder to test
>>> individual functions. [...]
>>>
>>> In programs without such global names you see quite clearly in the
>>> ``def`` line what the function expects as input.
>> Good points, although thorough commenting can go a long way to help on 
>> both counts. In theory, at least ...
> 
> Won't work in practice so well.  Say we have function `f()` and
> document that it expects global name `a` to be set to something before
> calling it. `f()` is used by other functions so we have to document `a` in
> all other functions too.  If we change `f()` to rely on global name `b`
> too we have to hunt down every function that calls `f()` and add the
> documentation for `b` there too.  It's much work and error prone.  Easy to
> get inconsistent or missing documentation this way.
> 
Essentially what Marc is saying is that you want your functions to be as 
loosely coupled to their environment as practically possible. See

   http://en.wikipedia.org/wiki/Coupling_(computer_science)

[...]

regards
  Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-05 Thread Marc 'BlackJack' Rintsch

On Tue, 05 Feb 2008 06:19:12 +, Odysseus wrote:

> In article <[EMAIL PROTECTED]>,
>  Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
> 
>> Another issue is testing.  If you rely on global names it's harder to test
>> individual functions. [...]
>> 
>> In programs without such global names you see quite clearly in the
>> ``def`` line what the function expects as input.
> 
> Good points, although thorough commenting can go a long way to help on 
> both counts. In theory, at least ...

Won't work in practice so well.  Say we have function `f()` and
document that it expects global name `a` to be set to something before
calling it. `f()` is used by other functions so we have to document `a` in
all other functions too.  If we change `f()` to rely on global name `b`
too we have to hunt down every function that calls `f()` and add the
documentation for `b` there too.  It's much work and error prone.  Easy to
get inconsistent or missing documentation this way.

To write or check documentation for a function you have to scan the whole
function body for data in global names and calls to other functions and
repeat the search there.  If you don't let functions communicate via global
names you just have to look at the argument list to see the input sources.

>> def main():
>> # Main program comes here.
>> 
>> if __name__ == '__main__':
>> main()
>> 
>> Then main is called when the script is called as program, but not called if
>> you just import the script as module.  For example to test functions or to
>> reuse the code from other scripts.
> 
> I'm using "if __name__ == 'main'" now, but only for test inputs (which 
> will eventually be read from a config file or passed by the calling 
> script -- or something). I hadn't thought of putting code that actually 
> does something there. As for writing modules, that's way beyond where I 
> want to go at this point: I don't know any C and am not sure I would 
> want to ...

What does this have to do with C!?  There's no specific C knowledge
involved here.

>> >> assert name.startswith('Name: ')
> 
>> It checks if `name` really starts with 'Name: '.  This way I turned the
>> comment into code that checks the assertion in the comment.
> 
> Good idea to check, although this is actually only one of many 
> assumptions I make about the data -- but what happens if the assertion 
> fails? The program stops and the interpreter reports an AssertionError 
> on line whatever?

Yes, you get an `AssertionError`:

In [314]: assert True

In [315]: assert False
---
Traceback (most recent call last)

/home/bj/ in ()

:

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-04 Thread Odysseus

In article <[EMAIL PROTECTED]>,
 Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:

> The term "global" usually means "module global" in Python.

Because they're like the objects obtained from "import"?

> [T]he functions depend on some magic data coming from "nowhere" and 
> it's much harder to follow the data flow in a program.  If you work 
> with globals you can't be sure what the following will print:
> 
> def spam():
> global x
> x = 42
> beep()
> print x
> 
> `beep()` might change `x` or any function called by `beep()` and so on. 

I think I get the general point, but couldn't "beep()" get at "x" even 
without the "global" statement, since they're both in "spam()"?

It seems natural to me to give the most important objects in a program 
persistent names: I guess this something of a 'security blanket' I need 
to wean myself from. I can appreciate the benefits of 
context-independence when it comes to reusing code.

> Another issue is testing.  If you rely on global names it's harder to test
> individual functions. [...]
> 
> In programs without such global names you see quite clearly in the
> ``def`` line what the function expects as input.

Good points, although thorough commenting can go a long way to help on 
both counts. In theory, at least ...

> It's easy to "enforce" if you have minimal code on the module level.  The
> usual idiom is:
> 
> def main():
> # Main program comes here.
> 
> if __name__ == '__main__':
> main()
> 
> Then main is called when the script is called as program, but not called if
> you just import the script as module.  For example to test functions or to
> reuse the code from other scripts.

I'm using "if __name__ == 'main'" now, but only for test inputs (which 
will eventually be read from a config file or passed by the calling 
script -- or something). I hadn't thought of putting code that actually 
does something there. As for writing modules, that's way beyond where I 
want to go at this point: I don't know any C and am not sure I would 
want to ...

[consolidating]

In article <[EMAIL PROTECTED]>,
 Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:

> Then you can either pass in `found` as argument instead of creating it
> here, or you collect the passes in the calling code with the `update()`
> method of `dict`.  Something like this:
> 
> found = dict()
> for pass in passes:
> # ...
> found.update(extract_data(names, na, cells))

Cool. I'll have to read more about dictionary methods.

> >> assert name.startswith('Name: ')

> It checks if `name` really starts with 'Name: '.  This way I turned the
> comment into code that checks the assertion in the comment.

Good idea to check, although this is actually only one of many 
assumptions I make about the data -- but what happens if the assertion 
fails? The program stops and the interpreter reports an AssertionError 
on line whatever?

> [I]f you can make the source simpler and easier to understand by 
> using the `index()` method, use a list.  :-)

Understood; thanks for all the tips.

-- 
Odysseus
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-04 Thread Odysseus

In article <[EMAIL PROTECTED]>,
 Dennis Lee Bieber <[EMAIL PROTECTED]> wrote:

> On Mon, 04 Feb 2008 09:43:04 GMT, Odysseus
> <[EMAIL PROTECTED]> declaimed the following in
> comp.lang.python:
> 
> > 
> > Thanks, that will be very useful. I was casting about for a replacement 
> > for PostScript's "for" loop, and the "while" loop (which PS lacks -- and 
> > which I've never missed there) was all I could come up with.
> >
>   Have you read the language reference manual yet? It is a rather
> short document given that the language syntactic elements are not that
> complex -- but would have exposed you to the "for" statement (along with
> "return" and passing arguments).

Sorry, translation problem: I am acquainted with Python's "for" -- if 
far from fluent with it, so to speak -- but the PS operator that's most 
similar (traversing a compound object, element by element, without any 
explicit indexing or counting) is called "forall". PS's "for" loop is 
similar to BASIC's (and ISTR Fortran's):

start_value increment end_value {procedure} for

I don't know the proper generic term -- "indexed loop"? -- but at any 
rate it provides a counter, unlike Python's command of the same name.

> If your only other programming experience is base PostScript you 
> wouldn't really be familiar with passing arguments or returning 
> values -- as an RPN stack-based language, argument passing is just 
> listing the arguments before a function call (putting a copy of them 
> on the stack), and returns are whatever the function left on the 
> stack at the end; hence they appear sort of global.

Working directly in the operand stack is efficient, but can make 
interpretation by humans -- and debugging -- very difficult. So for the 
sake of coder-friendliness it's generally advisable to use variables 
(i.e. assign values to keys in a dictionary) in most cases instead of 
passing values 'silently' via the stack. I'm beginning to realize that 
for Python the situation is just about the opposite ...

Anyway, I have been reading the documentation on the website, but much 
of the terminology is unfamiliar to me. When looking things up I seem to 
get an inordinate number of 404 errors from links returned by the search 
function, and often the language-reference or tutorial entries (if any) 
are buried several pages down. In general I'm finding the docs rather 
frustrating to navigate.

>   After the language reference manual, the library reference manual
> chapter on built-ins and data types would be next for study -- the rest
> can usually be handled via search functions (working with time
> conversions, look for modules with date or time ).

As I mentioned elsethread, I did look at the "time" documentation; it 
was there that I found a reference to the "calendar.timegm" function I 
used in my first attempt.

>   It looked a bit like you were using a SAX-style parser to collect
> "names" and "cells" -- and then passing the "bunch" to another function
> to trim out and convert data... It would take me a bit to restudy the
> SAX parsing scheme (I did it once, back in the days of v1.5 or so) but
> the way I'd /try/ to do it is to have the stream handler keep track of
> which cell ( tag) is currently being parsed, and convert the string
> data at that level. You'd initialize the record dictionary to {} (and
> cell position to 0) on the  tag, and return the populated record on
> the  tag.

This is what my setup looks like -- mostly cribbed from _Dive Into 
Python_ -- where "PageParser" is a class based on "SGMLParser":

from sgmllib import SGMLParser
from urllib import urlopen

# ...

def parse_page(url):
usock = urlopen(url)
parser = PageParser()
parser.feed(usock.read())
parser.close()
usock.close()
return parser

# ...

captured = parse_page(base_url + suffix)

I only use "parse_page" the once at this stage, but my plan was to call 
it repeatedly while varying "suffix" (depending on the data found by the 
previous pass). On each pass the class will initialize itself, which is 
why I was collecting the data into a 'standing' (global) dictionary. Are 
you suggesting essentially that I'd do better to make the text-parsing 
function into a method of "PageParser"? Can one add, to such a derived 
class, methods that don't have protoypes in the parent?

>   Might want to check into making a class/instance of the parser so
> you can make the record dictionary and column (cell) position instance
> attributes (avoiding globals).

AFAICT my "captured" is an instance of "PageParser", but I'm unclear on 
how I would add attributes to it -- and as things stand it will get 
rebuilt from scratch each time a page is read in.

> > [...] I'm somewhat intimidated by the whole concept of 
> > exception-handling (among others). How do you know to expect a 
> > "ValueError" if the string isn't a representation of a number?
> 
>   Read the library reference for the function in question? Though it
> appears the reference

Re: Elementary string-parsing

2008-02-04 Thread Marc 'BlackJack' Rintsch

On Mon, 04 Feb 2008 09:43:04 +, Odysseus wrote:

> In article <[EMAIL PROTECTED]>,
>  Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
> 
>> def extract_data(names, na, cells):
>> found = dict()
> 
> The problem with initializing the 'super-dictionary' within this 
> function is that I want to be able to add to it in further passes, with 
> a new set of "names" & "cells" each time.

Then you can either pass in `found` as argument instead of creating it
here, or you collect the passes in the calling code with the `update()`
method of `dict`.  Something like this:

found = dict()
for pass in passes:
# ...
found.update(extract_data(names, na, cells))

> BTW what's the difference between the above and "found = {}"?

I find it more "explicit".  ``dict`` and ``list`` are easier to
distinguish than ``{}`` and ``[]`` after a lng coding session or when
printed/displayed in a small font.  It's just a matter of taste.

>> for i, name in enumerate(names):
>> data = dict()
>> cells_index = 10 * i + na
>> for cell_name, index, parse in (('epoch1', 0, parse_date),
>> ('epoch2', 1, parse_date),
>> ('time', 5, parse_number),
>> ('score1', 6, parse_number),
>> ('score2', 7, parse_number)):
>> data[cell_name] = parse(cells[cells_index + index])
> 
> This looks a lot more efficient than my version, but what about the 
> strings that don't need parsing? Would it be better to define a 
> 'pass-through' function that just returns its input, so they can be 
> handled by the same loop, or to handle them separately with another loop?

I'd handle them in the same loop.  A "pass-through" function for strings
already exists:

In [255]: str('hello')
Out[255]: 'hello'

>> assert name.startswith('Name: ')
> 
> I looked up "assert", but all I could find relates to debugging. Not 
> that I think debugging is something I can do without ;) but I don't 
> understand what this line does.

It checks if `name` really starts with 'Name: '.  This way I turned the
comment into code that checks the assertion in the comment.

>> The `parse_number()` function could look like this:
>> 
>> def parse_number(string):
>> try:
>> return float(string.replace(',', ''))
>> except ValueError:
>> return string
>> 
>> Indeed the commas can be replaced a bit more elegant.  :-)
> 
> Nice, but I'm somewhat intimidated by the whole concept of 
> exception-handling (among others). How do you know to expect a 
> "ValueError" if the string isn't a representation of a number?

Experience.  I just tried what happens if I feed `float()` with a string
that is no number:

In [256]: float('abc')
---
Traceback (most recent call last)

/home/bj/ in ()

: invalid literal for float(): abc

> Is there a list of common exceptions somewhere? (Searching for
> "ValueError" turned up hundreds of passing mentions, but I couldn't find
> a definition or explanation.)

The definition is quite vague.  The type of an argument is correct, but
there's something wrong with the value.

See http://docs.python.org/lib/module-exceptions.html for an overview of
the built in exceptions.

>> As already said, that ``while`` loop should be a ``for`` loop.  But if
>> you put `m_abbrevs` into a `list` you can replace the loop with a
>> single call to its `index()` method: ``dlist[1] =
>> m_abbrevs.index(dlist[1]) + 1``.
> 
> I had gathered that lists shouldn't be used for storing constants. Is
> that more of a suggestion than a rule?

Some suggest this.  Others say tuples are for data where the position of
an element has a "meaning" and lists are for elements that all have the
same "meaning" for some definition of meaning.  As an example ('John',
'Doe', 'Dr.') vs. ['Peter', 'Paul', 'Mary'].  In the first example we have
name, surname, title and in the second example all elements are just
names.  Unless the second example models a relation like child, father,
mother, or something like that.  Anyway, if you can make the source simpler
and easier to understand by using the `index()` method, use a list.  :-)

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-04 Thread Marc 'BlackJack' Rintsch

On Mon, 04 Feb 2008 12:25:24 +, Odysseus wrote:

> I'm not clear on what makes an object global, other than appearing as an 
> operand of a "global" statement, which I don't use anywhere. But "na" is 
> assigned its value in the program body, not within any function: does 
> that make it global?

Yes.  The term "global" usually means "module global" in Python.

> Why is this not recommended?

Because the functions depend on some magic data coming from "nowhere" and
it's much harder to follow the data flow in a program.  If you work with
globals you can't be sure what the following will print:

def spam():
global x
x = 42
beep()
print x

`beep()` might change `x` or any function called by `beep()` and so on. 

Another issue is testing.  If you rely on global names it's harder to test
individual functions.  If I want to test your `extract_data()` I first have
to look through the whole function body and search all the global
references and bind those names to values before I can call the function. 
This might not be enough, any function called by `extract_data()` might
need some global assignments too.  This way you'll get quite soon to a
point where the single parts of a program can't be tested in isolation and
are not reusable for other programs.

In programs without such global names you see quite clearly in the
``def`` line what the function expects as input.

> If I wrap the assignment in a function, making "na" a local variable, how
> can "extract_data" then access it?

Give it as an argument.  As a rule of thumb values should enter a function
as arguments and leave it as return values.

It's easy to "enforce" if you have minimal code on the module level.  The
usual idiom is:

def main():
# Main program comes here.

if __name__ == '__main__':
main()

Then main is called when the script is called as program, but not called if
you just import the script as module.  For example to test functions or to
reuse the code from other scripts.

>> def extract_data(names, na, cells):
>> 
>>  and
>> 
>>  return 
> 
> What should it return? A Boolean indicating success or failure? All the
> data I want should all have been stored in the "found" dictionary by the
> time the function finishes traversing the list of names.

Then create the `found` dictionary in that function and return it at the
end.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-04 Thread Odysseus

In article <[EMAIL PROTECTED]>,
 Dennis Lee Bieber <[EMAIL PROTECTED]> wrote:

>   Rather complicated description... A sample of the real/actual input
> /file/ would be useful.

Sorry, I didn't want to go on too long about the background, but I guess 
more context would have helped. The data actually come from a web page; 
I use a class based on SGMLParser to do the initial collection. The 
items in the "names" list were originally "title" attributes of anchor 
tags and are obtained with a "start_a" method, while "cells" holds the 
contents of the  tags, obtained by a "handle_data" method according 
to the state of a flag that's set to True by a "start_td" method and to 
False by an "end_td". I don't care about anything else on the page, so I 
didn't define most of the tag-specific methods available.

>   cellRoot = 10 * i + na  #where did na come from?
>   #heck, where do 
> names and cells
>   #come from? 
> Globals? Not recommended..

The variable "na" is the number of 'not applicable' items (headings and 
whatnot) preceding the data I'm interested in.

I'm not clear on what makes an object global, other than appearing as an 
operand of a "global" statement, which I don't use anywhere. But "na" is 
assigned its value in the program body, not within any function: does 
that make it global? Why is this not recommended? If I wrap the 
assignment in a function, making "na" a local variable, how can 
"extract_data" then access it?

The lists of data are attributes (?) of my SGMLParser class; in my 
misguided attempt to pare irrelevant details from "extract_data" I 
obfuscated this aspect. I have a "parse_page(url)" function that returns 
an instance of the class, as "captured", and the lists in question are 
actually called "captured.names" and "captured.cells". The 
"parse_page(url)" function is called in the program body; does that make 
its output global as well?

>   use
> 
> def extract_data(names, na, cells):
> 
>   and 
> 
>   return 

What should it return? A Boolean indicating success or failure? All the 
data I want should all have been stored in the "found" dictionary by the 
time the function finishes traversing the list of names.

> > for k in ('time', 'score1', 'score2'):
> > v = found[name][k]
> > if v != "---" and v != "n/a": # skip non-numeric data
> > v = ''.join(v.split(",")) # remove commas between 000s
> > found[name][k] = float(v)
> 
>   I'd suggest splitting this into a short function, and invoking it in
> the preceding... say it is called "parsed"
> 
>   "time" : parsed(cells[cellRoot + 5]),

Will do. I guess part of my problem is that being unsure of myself I'm 
reluctant to attempt too much in a single complex statement, finding it 
easier to take small and simple (but inefficient) steps. I'll have to 
learn to consolidate things as I go.

>   Did you check the library for time/date parsing/formatting
> operations?
> 
> >>> import time
> >>> aTime = "03 Feb 2008 20:35:46 UTC"#DD Mth  HH:MM:SS UTC
> >>> time.strptime(aTime, "%d %b %Y %H:%M:%S %Z")
> (2008, 2, 3, 20, 35, 46, 6, 34, 0)

I looked at the documentation for the "time" module, including 
"strptime", but I didn't realize the "%b" directive would match the 
month abbreviations I'm dealing with. It's described as "Locale's 
abbreviated month name"; if someone were to run my program on a French 
system e.g., wouldn't it try to find a match among "jan", "fév", ..., 
"déc" (or whatever) and fail? Is there a way to declare a "locale" that 
will override the user's settings? Are the locale-specific strings 
documented anywhere? Can one assume them to be identical in all 
English-speaking countries, at least? Now it's pretty unlikely in this 
case that such an 'international situation' will arise, but I didn't 
want to burn any bridges ...

I was also somewhat put off "strptime" on reading the caveat "Note: This 
function relies entirely on the underlying platform's C library for the 
date parsing, and some of these libraries are buggy. There's nothing to 
be done about this short of a new, portable implementation of 
strptime()." If it works, however, it'll be a lot tidier than what I was 
doing. I'll make a point of testing it on its own, with a variety of 
inputs.

> Note that the %Z is a problematic entry...

> ValueError: time data did not match format:  data=03 Feb 2008 
> 20:35:46 PST  fmt=%d %b %Y %H:%M:%S %Z

All the times are UTC, so fortunately this is a non-issue for my 
purposes of the moment. May I assume that leaving the zone out will 
cause the time to be treated as UTC?

Thanks for your help, and for bearing with my elementary questions and 
my fumbling about.

-- 
Odysseus
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-04 Thread John Machin

On Feb 4, 8:43 pm, Odysseus <[EMAIL PROTECTED]> wrote:
> In article <[EMAIL PROTECTED]>,
>  Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:
> > found = dict()
> BTW what's the difference between the above and "found = {}"?

{} takes 4 fewer keystrokes, doesn't have the overhead of a function
call, and works with Pythons at least as far back as 1.5.2 -- apart
from that, it's got absolutely nothing going for it ;-)
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-04 Thread Paul Hankin

On Feb 4, 3:21 am, Odysseus <[EMAIL PROTECTED]> wrote:
> The next one is much messier. A couple of the strings represent times,
> which I think will be most useful in 'native' form, but the input is in
> the format "DD Mth  HH:MM:SS UTC".

time.strptime will do this!

You can find the documentation at http://docs.python.org/lib/module-time.html

Untested:
time.strptime(my_date, '%d %b %y %H:%M:%S %Z')

--
Paul Hankin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-04 Thread Odysseus

In article <[EMAIL PROTECTED]>,
 Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote:

> Here and in later code you use a ``while`` loop although it is known at
> loop start how many times the loop body will be executed.  That's a job
> for a ``for`` loop.  If possible not over an integer that is used later
> just as index into list, but the list itself.  Here you need both, index
> and objects from `names`.  There's the `enumerate()` function for creating
> an iterable of (index, name) from `names`.

Thanks, that will be very useful. I was casting about for a replacement 
for PostScript's "for" loop, and the "while" loop (which PS lacks -- and 
which I've never missed there) was all I could come up with.

> I'd put all the relevant information that describes a field of the
> dictionary that is put into `found` into tuples and loop over it.  There
> is the cell name, the index of the cell and function that converts the
> string from that cell into an object that is stored in the dictionary. 
> This leads to (untestet):
> 
> def extract_data(names, na, cells):
> found = dict()

The problem with initializing the 'super-dictionary' within this 
function is that I want to be able to add to it in further passes, with 
a new set of "names" & "cells" each time.

BTW what's the difference between the above and "found = {}"?

> for i, name in enumerate(names):
> data = dict()
> cells_index = 10 * i + na
> for cell_name, index, parse in (('epoch1', 0, parse_date),
> ('epoch2', 1, parse_date),
> ('time', 5, parse_number),
> ('score1', 6, parse_number),
> ('score2', 7, parse_number)):
> data[cell_name] = parse(cells[cells_index + index])

This looks a lot more efficient than my version, but what about the 
strings that don't need parsing? Would it be better to define a 
'pass-through' function that just returns its input, so they can be 
handled by the same loop, or to handle them separately with another loop?

> assert name.startswith('Name: ')

I looked up "assert", but all I could find relates to debugging. Not 
that I think debugging is something I can do without ;) but I don't 
understand what this line does.

> found[name[6:]] = data
> return found
> 
> The `parse_number()` function could look like this:
> 
> def parse_number(string):
> try:
> return float(string.replace(',', ''))
> except ValueError:
> return string
> 
> Indeed the commas can be replaced a bit more elegant.  :-)

Nice, but I'm somewhat intimidated by the whole concept of 
exception-handling (among others). How do you know to expect a 
"ValueError" if the string isn't a representation of a number? Is there 
a list of common exceptions somewhere? (Searching for "ValueError" 
turned up hundreds of passing mentions, but I couldn't find a definition 
or explanation.)

> 
> As already said, that ``while`` loop should be a ``for`` loop.  But if you
> put `m_abbrevs` into a `list` you can replace the loop with a single call
> to its `index()` method: ``dlist[1] = m_abbrevs.index(dlist[1]) + 1``.

I had gathered that lists shouldn't be used for storing constants. Is 
that more of a suggestion than a rule? I take it tuples don't have an 
"index()" method.

Thanks for the detailed advice. I'll post back if I have any trouble 
implementing your suggestions.

-- 
Odysseus
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Elementary string-parsing

2008-02-04 Thread Marc 'BlackJack' Rintsch

On Mon, 04 Feb 2008 03:21:18 +, Odysseus wrote:

> def extract_data():
> i = 0
> while i < len(names):
> name = names[i][6:] # strip off "Name: "
> found[name] = {'epoch1': cells[10 * i + na],
>'epoch2': cells[10 * i + na + 1],
>'time': cells[10 * i + na + 5],
>'score1': cells[10 * i + na + 6],
>'score2': cells[10 * i + na + 7]}

Here and in later code you use a ``while`` loop although it is known at
loop start how many times the loop body will be executed.  That's a job
for a ``for`` loop.  If possible not over an integer that is used later
just as index into list, but the list itself.  Here you need both, index
and objects from `names`.  There's the `enumerate()` function for creating
an iterable of (index, name) from `names`.

I'd put all the relevant information that describes a field of the
dictionary that is put into `found` into tuples and loop over it.  There
is the cell name, the index of the cell and function that converts the
string from that cell into an object that is stored in the dictionary. 
This leads to (untestet):

def extract_data(names, na, cells):
found = dict()
for i, name in enumerate(names):
data = dict()
cells_index = 10 * i + na
for cell_name, index, parse in (('epoch1', 0, parse_date),
('epoch2', 1, parse_date),
('time', 5, parse_number),
('score1', 6, parse_number),
('score2', 7, parse_number)):
data[cell_name] = parse(cells[cells_index + index])
assert name.startswith('Name: ')
found[name[6:]] = data
return found

The `parse_number()` function could look like this:

def parse_number(string):
try:
return float(string.replace(',', ''))
except ValueError:
return string

Indeed the commas can be replaced a bit more elegant.  :-)

`parse_date()` is left as an exercise for the reader.

> for k in ('epoch1', 'epoch2'):
> dlist = found[name][k].split(" ")
> m = 0
> while m < 12:
> if m_abbrevs[m] == dlist[1]:
> dlist[1] = m + 1
> break
> m += 1
> tlist = dlist[3].split(":")
> found[name][k] = timegm((int(dlist[2]), int(dlist[1]),
>  int(dlist[0]), int(tlist[0]),
>  int(tlist[1]), int(tlist[2]),
>  -1, -1, 0))
> i += 1
> 
> The function appears to be working OK as is, but I would welcome any & 
> all suggestions for improving it or making it more idiomatic.

As already said, that ``while`` loop should be a ``for`` loop.  But if you
put `m_abbrevs` into a `list` you can replace the loop with a single call
to its `index()` method: ``dlist[1] = m_abbrevs.index(dlist[1]) + 1``.

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list

Elementary string-parsing

2008-02-03 Thread Odysseus

I'm writing my first 'real' program, i.e. that has a purpose aside from 
serving as a learning exercise. I'm posting to solicit comments about my 
efforts at translating strings from an external source into useful data, 
regarding efficiency and 'pythonicity' both. My only significant 
programming experience is in PostScript, and I feel that I haven't yet 
'found my feet' concerning the object-oriented aspects of Python, so I'd 
be especially interested to know where I may be neglecting to take 
advantage of them.

My input is in the form of correlated lists of strings, which I want to 
merge (while ignoring some extraneous items). I populate a dictionary 
called "found" with these data, still in string form. It contains 
sub-dictionaries of various items keyed to strings extracted from the 
list "names"; these sub-dictionaries in turn contain the associated 
items I want from "cells". After loading in the strings (I have omitted 
the statements that pick up strings that require no further processing, 
some of them coming from a third list), I convert selected items in 
place. Here's the function I wrote:

def extract_data():
i = 0
while i < len(names):
name = names[i][6:] # strip off "Name: "
found[name] = {'epoch1': cells[10 * i + na],
   'epoch2': cells[10 * i + na + 1],
   'time': cells[10 * i + na + 5],
   'score1': cells[10 * i + na + 6],
   'score2': cells[10 * i + na + 7]}
###
Following is my first parsing step, for those data that represent real 
numbers. The two obstacles I'm contending with here are that the figures 
have commas grouping the digits in threes, and that sometimes the data 
are non-numeric -- I'll deal with those later. Is there a more elegant 
way of removing the commas than the split-and-rejoin below?
###
for k in ('time', 'score1', 'score2'):
v = found[name][k]
if v != "---" and v != "n/a": # skip non-numeric data
v = ''.join(v.split(",")) # remove commas between 000s
found[name][k] = float(v)
###
The next one is much messier. A couple of the strings represent times, 
which I think will be most useful in 'native' form, but the input is in 
the format "DD Mth  HH:MM:SS UTC". Near the beginning of my program 
I have "from calendar import timegm". Before I can feed the data to this 
function, though, I have to convert the month abbreviation to a number. 
I couldn't come up with anything more elegant than look-up from a list: 
the relevant part of my initialization is
'''
m_abbrevs = ("Jan", "Feb", "Mar", "Apr", "May", "Jun",
 "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
'''
I'm also rather unhappy with the way I kluged the seventh and eighth 
values in the tuple passed to timegm, the order of the date in the week 
and in the year respectively. (I would hate to have to calculate them.) 
The function doesn't seem to care what values I give it for these -- as 
long as I don't omit them -- so I guess they're only there for the sake 
of matching the output of the inverse function. Is there a version of 
timegm that takes a tuple of only six (or seven) elements, or any better 
way to handle this situation?
###
for k in ('epoch1', 'epoch2'):
dlist = found[name][k].split(" ")
m = 0
while m < 12:
if m_abbrevs[m] == dlist[1]:
dlist[1] = m + 1
break
m += 1
tlist = dlist[3].split(":")
found[name][k] = timegm((int(dlist[2]), int(dlist[1]),
 int(dlist[0]), int(tlist[0]),
 int(tlist[1]), int(tlist[2]),
 -1, -1, 0))
i += 1

The function appears to be working OK as is, but I would welcome any & 
all suggestions for improving it or making it more idiomatic.

-- 
Odysseus
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: string parsing / regexp question

2007-11-28 Thread Ryan Krauss

On Nov 28, 2007 1:23 PM, Paul McGuire <[EMAIL PROTECTED]> wrote:
> On Nov 28, 11:32 am, "Ryan Krauss" <[EMAIL PROTECTED]> wrote:
> > I need to parse the following string:
> >
> > $$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
> >  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
> >  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
> >  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$
> >
> > The first thing I need to do is extract the arguments to \pmatrix{ }
> > on both the left and right hand sides of the equal sign, so that the
> > first argument is extracted as
> >
> > {\it x_2}\cr 0\cr 1\cr
> >
> > and the second is
> >
> > \left({{{\it m_2}\,s^2
> >  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
> >  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
> >  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr
> >
> > The trick is that there are extra curly braces inside the \pmatrix{ }
> > strings and I don't know how to write a regexp that would count the
> > number of open and close curly braces and make sure they match, so
> > that it can find the correct ending curly brace.
> >
>
> As Tim Grove points out, writing a grammar for this expression is
> really pretty simple, especially using the latest version of
> pyparsing, which includes a new helper method, nestedExpr.  Here is
> the whole program to parse your example:
>
> from pyparsing import *
>
> data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
> \pmatrix{\left({{{\it m_2}\,s^2
>  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
> m_2}\,s^2\,F
>  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
> m_2}\,s^2}\over{k}}+1
>  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""
>
> PMATRIX = Literal(r"\pmatrix")
> nestedBraces = nestedExpr("{","}")
> grammar = "$$" + PMATRIX + nestedBraces + "=" + \
>  PMATRIX + nestedBraces + \
>  "$$"
> res = grammar.parseString(data)
> print res
>
> This prints the following:
>
> ['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
> '\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
> ['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
> ['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
> ['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
> 'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
> 'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']
>
> Okay, maybe this looks a bit messy.  But believe it or not, the
> returned results give you access to each grammar element as:
>
> ['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
> [nestedArgList], '$$']
>
> Not only has the parser handled the {} nesting levels, but it has
> structured the returned tokens according to that nesting.  (The '{}'s
> are gone now, since their delimiting function has been replaced by the
> nesting hierarchy in the results.)
>
> You could use tuple assignment to get at the individual fields:
> dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res
>
> Or you could access the fields in res using list indexing:
> lhs_args, rhs_args = res[2],res[5]
>
> But both of these methods will break if you decide to extend the
> grammar with additional or optional fields.
>
> A safer approach is to give the grammar elements results names, as in
> this slightly modified version of grammar:
>
> grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
>  PMATRIX + nestedBraces("rhs_args") + \
>  "$$"
>
> Now you can access the parsed fields as if the results were a dict
> with keys "lhs_args" and "rhs_args", or as an object with attributes
> named "lhs_args" and "rhs_args":
>
> res = grammar.parseString(data)
> print res["lhs_args"]
> print res["rhs_args"]
> print res.lhs_args
> print res.rhs_args
>
> Note that the default behavior of nestedExpr is to give back a nested
> list of the elements according to how the original text was nested
> within braces.
>
> If you just want the original text, add a parse action to nestedBraces
> to do this for you (keepOriginalText is another pyparsing builtin).
> The parse action is executed at parse time so that there is no post-
> processing needed after the parsed results are returned:
>
> nestedBraces.setParseAction(keepOriginalText)
> grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
>  PMATRIX + nestedBraces("rhs_args") + \
>  "$$"
>
> res = grammar.parseString(data)
> print res
> print res.lhs_args
> print res.rhs_args
>
> Now this program returns the original text for the nested brace
> expressions:
>
> ['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
> '{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
> \over{k}}\\cr -{{

Re: string parsing / regexp question

2007-11-28 Thread Ryan Krauss

Interesting.  Thanks Paul and Tim.  This looks very promising.

Ryan

On Nov 28, 2007 1:23 PM, Paul McGuire <[EMAIL PROTECTED]> wrote:
> On Nov 28, 11:32 am, "Ryan Krauss" <[EMAIL PROTECTED]> wrote:
> > I need to parse the following string:
> >
> > $$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
> >  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
> >  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
> >  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$
> >
> > The first thing I need to do is extract the arguments to \pmatrix{ }
> > on both the left and right hand sides of the equal sign, so that the
> > first argument is extracted as
> >
> > {\it x_2}\cr 0\cr 1\cr
> >
> > and the second is
> >
> > \left({{{\it m_2}\,s^2
> >  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
> >  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
> >  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr
> >
> > The trick is that there are extra curly braces inside the \pmatrix{ }
> > strings and I don't know how to write a regexp that would count the
> > number of open and close curly braces and make sure they match, so
> > that it can find the correct ending curly brace.
> >
>
> As Tim Grove points out, writing a grammar for this expression is
> really pretty simple, especially using the latest version of
> pyparsing, which includes a new helper method, nestedExpr.  Here is
> the whole program to parse your example:
>
> from pyparsing import *
>
> data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
> \pmatrix{\left({{{\it m_2}\,s^2
>  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
> m_2}\,s^2\,F
>  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
> m_2}\,s^2}\over{k}}+1
>  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""
>
> PMATRIX = Literal(r"\pmatrix")
> nestedBraces = nestedExpr("{","}")
> grammar = "$$" + PMATRIX + nestedBraces + "=" + \
>  PMATRIX + nestedBraces + \
>  "$$"
> res = grammar.parseString(data)
> print res
>
> This prints the following:
>
> ['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
> '\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
> ['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
> ['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
> ['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
> 'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
> 'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']
>
> Okay, maybe this looks a bit messy.  But believe it or not, the
> returned results give you access to each grammar element as:
>
> ['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
> [nestedArgList], '$$']
>
> Not only has the parser handled the {} nesting levels, but it has
> structured the returned tokens according to that nesting.  (The '{}'s
> are gone now, since their delimiting function has been replaced by the
> nesting hierarchy in the results.)
>
> You could use tuple assignment to get at the individual fields:
> dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res
>
> Or you could access the fields in res using list indexing:
> lhs_args, rhs_args = res[2],res[5]
>
> But both of these methods will break if you decide to extend the
> grammar with additional or optional fields.
>
> A safer approach is to give the grammar elements results names, as in
> this slightly modified version of grammar:
>
> grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
>  PMATRIX + nestedBraces("rhs_args") + \
>  "$$"
>
> Now you can access the parsed fields as if the results were a dict
> with keys "lhs_args" and "rhs_args", or as an object with attributes
> named "lhs_args" and "rhs_args":
>
> res = grammar.parseString(data)
> print res["lhs_args"]
> print res["rhs_args"]
> print res.lhs_args
> print res.rhs_args
>
> Note that the default behavior of nestedExpr is to give back a nested
> list of the elements according to how the original text was nested
> within braces.
>
> If you just want the original text, add a parse action to nestedBraces
> to do this for you (keepOriginalText is another pyparsing builtin).
> The parse action is executed at parse time so that there is no post-
> processing needed after the parsed results are returned:
>
> nestedBraces.setParseAction(keepOriginalText)
> grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
>  PMATRIX + nestedBraces("rhs_args") + \
>  "$$"
>
> res = grammar.parseString(data)
> print res
> print res.lhs_args
> print res.rhs_args
>
> Now this program returns the original text for the nested brace
> expressions:
>
> ['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
> '{\\left({{{\\it m_2}

Re: string parsing / regexp question

2007-11-28 Thread Tim Chase

Paul McGuire wrote:
> On Nov 28, 1:23 pm, Paul McGuire <[EMAIL PROTECTED]> wrote:
>> As Tim Grove points out, ...
> 
> s/Grove/Chase/
> 
> Sorry, Tim!

No problem...it's not like there aren't enough Tim's on the list 
as it is. :)

-tkc




-- 
http://mail.python.org/mailman/listinfo/python-list

Re: string parsing / regexp question

2007-11-28 Thread Paul McGuire

On Nov 28, 1:23 pm, Paul McGuire <[EMAIL PROTECTED]> wrote:
> As Tim Grove points out, ...

s/Grove/Chase/

Sorry, Tim!

-- Paul
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: string parsing / regexp question

2007-11-28 Thread Paul McGuire

On Nov 28, 11:32 am, "Ryan Krauss" <[EMAIL PROTECTED]> wrote:
> I need to parse the following string:
>
> $$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
>  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
>  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
>  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$
>
> The first thing I need to do is extract the arguments to \pmatrix{ }
> on both the left and right hand sides of the equal sign, so that the
> first argument is extracted as
>
> {\it x_2}\cr 0\cr 1\cr
>
> and the second is
>
> \left({{{\it m_2}\,s^2
>  }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
>  }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
>  \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr
>
> The trick is that there are extra curly braces inside the \pmatrix{ }
> strings and I don't know how to write a regexp that would count the
> number of open and close curly braces and make sure they match, so
> that it can find the correct ending curly brace.
>

As Tim Grove points out, writing a grammar for this expression is
really pretty simple, especially using the latest version of
pyparsing, which includes a new helper method, nestedExpr.  Here is
the whole program to parse your example:

from pyparsing import *

data = r"""$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=
\pmatrix{\left({{{\it m_2}\,s^2
 }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it
m_2}\,s^2\,F
 }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it
m_2}\,s^2}\over{k}}+1
 \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$"""

PMATRIX = Literal(r"\pmatrix")
nestedBraces = nestedExpr("{","}")
grammar = "$$" + PMATRIX + nestedBraces + "=" + \
 PMATRIX + nestedBraces + \
 "$$"
res = grammar.parseString(data)
print res

This prints the following:

['$$', '\\pmatrix', [['\\it', 'x_2'], '\\cr', '0\\cr', '1\\cr'], '=',
'\\pmatrix', ['\\left(', [[['\\it', 'm_2'], '\\,s^2'], '\\over',
['k']], '+1\\right)\\,', ['\\it', 'x_1'], '-', [['F'], '\\over',
['k']], '\\cr', '-', [[['\\it', 'm_2'], '\\,s^2\\,F'], '\\over',
['k']], '-F+\\left(', ['\\it', 'm_2'], '\\,s^2\\,\\left(', [[['\\it',
'm_2'], '\\,s^2'], '\\over', ['k']], '+1', '\\right)+', ['\\it',
'm_2'], '\\,s^2\\right)\\,', ['\\it', 'x_1'], '\\cr', '1\\cr'], '$$']

Okay, maybe this looks a bit messy.  But believe it or not, the
returned results give you access to each grammar element as:

['$$', '\\pmatrix', [nested arg list], '=', '\\pmatrix',
[nestedArgList], '$$']

Not only has the parser handled the {} nesting levels, but it has
structured the returned tokens according to that nesting.  (The '{}'s
are gone now, since their delimiting function has been replaced by the
nesting hierarchy in the results.)

You could use tuple assignment to get at the individual fields:
dummy,dummy,lhs_args,dummy,dummy,rhs_args,dummy = res

Or you could access the fields in res using list indexing:
lhs_args, rhs_args = res[2],res[5]

But both of these methods will break if you decide to extend the
grammar with additional or optional fields.

A safer approach is to give the grammar elements results names, as in
this slightly modified version of grammar:

grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
 PMATRIX + nestedBraces("rhs_args") + \
 "$$"

Now you can access the parsed fields as if the results were a dict
with keys "lhs_args" and "rhs_args", or as an object with attributes
named "lhs_args" and "rhs_args":

res = grammar.parseString(data)
print res["lhs_args"]
print res["rhs_args"]
print res.lhs_args
print res.rhs_args

Note that the default behavior of nestedExpr is to give back a nested
list of the elements according to how the original text was nested
within braces.

If you just want the original text, add a parse action to nestedBraces
to do this for you (keepOriginalText is another pyparsing builtin).
The parse action is executed at parse time so that there is no post-
processing needed after the parsed results are returned:

nestedBraces.setParseAction(keepOriginalText)
grammar = "$$" + PMATRIX + nestedBraces("lhs_args") + "=" + \
 PMATRIX + nestedBraces("rhs_args") + \
 "$$"

res = grammar.parseString(data)
print res
print res.lhs_args
print res.rhs_args

Now this program returns the original text for the nested brace
expressions:

['$$', '\\pmatrix', '{{\\it x_2}\\cr 0\\cr 1\\cr }', '=', '\\pmatrix',
'{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}\
\over{k}}\\cr -{{{\\it m_2}\\,s^2\\,F \n }\\over{k}}-F+\\left({\\it
m_2}\\,s^2\\,\\left({{{\\it m_2}\\,s^2}\\over{k}}+1 \n \\right)+{\\it
m_2}\\,s^2\\right)\\,{\\it x_1}\\cr 1\\cr }', '$$']
['{{\\it x_2}\\cr 0\\cr 1\\cr }']
['{\\left({{{\\it m_2}\\,s^2 \n }\\over{k}}+1\\right)\\,{\\it x_1}-{{F}
\\over{k

Re: string parsing / regexp question

2007-11-28 Thread Tim Chase

> The trick is that there are extra curly braces inside the \pmatrix{ }
> strings and I don't know how to write a regexp that would count the
> number of open and close curly braces and make sure they match, so
> that it can find the correct ending curly brace.

This criterion is pretty much a deal-breaker for using regexps, 
as you can't really nest things to arbitrary depths using regexps.

You really do need a parser of sorts, and pyparsing[1] is one of 
the more popular parsers, and fairly easy to use.

-tim

[1] http://pyparsing.wikispaces.com/




-- 
http://mail.python.org/mailman/listinfo/python-list

string parsing / regexp question

2007-11-28 Thread Ryan Krauss

I need to parse the following string:

$$\pmatrix{{\it x_2}\cr 0\cr 1\cr }=\pmatrix{\left({{{\it m_2}\,s^2
 }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
 }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
 \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr }$$

The first thing I need to do is extract the arguments to \pmatrix{ }
on both the left and right hand sides of the equal sign, so that the
first argument is extracted as

{\it x_2}\cr 0\cr 1\cr

and the second is

\left({{{\it m_2}\,s^2
 }\over{k}}+1\right)\,{\it x_1}-{{F}\over{k}}\cr -{{{\it m_2}\,s^2\,F
 }\over{k}}-F+\left({\it m_2}\,s^2\,\left({{{\it m_2}\,s^2}\over{k}}+1
 \right)+{\it m_2}\,s^2\right)\,{\it x_1}\cr 1\cr

The trick is that there are extra curly braces inside the \pmatrix{ }
strings and I don't know how to write a regexp that would count the
number of open and close curly braces and make sure they match, so
that it can find the correct ending curly brace.

Any suggestions?

I would  prefer a regexp solution, but am open to other approaches.

Thanks,

Ryan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-09 Thread Paul Boddie

Dennis Lee Bieber wrote:
>
>  I was trying to stay with a solution the should have been available
> in the version of Python equivalent to the Jython being used by the
> original poster. HTMLParser, according to the documents, was 2.2 level.

I guess I should read the whole thread before posting. ;-) I'll have
to look into libxml2 availability for Java, though, as it appears
(from various accounts) that some Java platform users struggle with
HTML parsing or have a really limited selection of decent and
performant parsers in that area.

Another thing for the "to do" list...

Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-09 Thread Paul Boddie

On 9 May, 06:42, Dennis Lee Bieber <[EMAIL PROTECTED]> wrote:
>

[HTMLParser-based solution]

Here's another approach using libxml2dom [1] in HTML parsing mode:

import libxml2dom

# The text, courtesy of Dennis.
sample = """






"""

# Parse the string in HTML mode.
d = libxml2dom.parseString(sample, html=1)

# For all input fields having the name 'LastUpdated',
# get the value attribute.
last_updated_fields = d.xpath("//[EMAIL PROTECTED]'LastUpdated']/@value")

# Assuming we find one, print the contents of the value attribute.
print last_updated_fields[0].nodeValue

Paul

[1] http://www.python.org/pypi/libxml2dom

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-09 Thread HMS Surprise

BTW, here's what I used, the other ideas have been squirreled away in
my neat tricks and methods folder.

for el in data.splitlines():
if el.find('LastUpdated') <> -1:
s = el.split("=")[-1].split('"')[1]
print 's:', s


Thanks again,

jh

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-09 Thread HMS Surprise


> This looks to be simple HTML (and I'm presuming that's a type on
> that ?> ending). A quick glance at the Python library reference (you do
> have a copy, don't you) reveals at least two HTML parsing modules...
>

No that is not a typo and bears investigation. Thanks for the find.

I found HTMLParser but had trouble setting it up.

> About five minutes work gave me this:
>

My effort has been orders of magnitude greater in time.

Thanks all for all the excellent suggestions.


jh

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-08 Thread HMS Surprise

Thanks all.

Carsten, you are here early and late.  Do you ever sleep? ;^)

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-08 Thread Carsten Haese

On 8 May 2007 19:06:14 -0700, HMS Surprise wrote
> Thanks for posting. Could you reccommend an HTML parser that can be
> used with python or jython?

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/) makes HTML
parsing easy as pie, and sufficiently old versions seem to work with Jython. I
just tested this with Jython 2.2a1 and BeautifulSoup 1.x:

Jython 2.2a1 on java1.5.0_07 (JIT: null)
Type "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("")
>>> print soup.first('input', {'name':'LastUpdated'}).get('value')
1178658863

Hope this helps,

--
Carsten Haese
http://informixdb.sourceforge.net

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-08 Thread Gabriel Genellina

En Tue, 08 May 2007 23:06:14 -0300, HMS Surprise <[EMAIL PROTECTED]>  
escribió:

> Thanks for posting. Could you reccommend an HTML parser that can be
> used with python or jython?

Try BeautifoulSoup, which handles malformed pages pretty well.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-08 Thread HMS Surprise

On May 8, 9:19 pm, HMS Surprise <[EMAIL PROTECTED]> wrote:
> Yes it could, after I isolate that one string. Making sure I that I
> isolate that complete line and only that line is part of the problem.
>

It comes in as one large string...


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-08 Thread HMS Surprise

Yes it could, after I isolate that one string. Making sure I that I
isolate that complete line and only that line is part of the problem.

thanks for posting.

jh

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-08 Thread HMS Surprise

Thanks for posting. Could you reccommend an HTML parser that can be
used with python or jython?


john


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-08 Thread Tim Leslie

On 8 May 2007 18:09:52 -0700, HMS Surprise <[EMAIL PROTECTED]> wrote:
>
> The string below is a piece of a longer string of about 2
> characters returned from a web page. I need to isolate the number at
> the end of the line containing 'LastUpdated'. I can find
> 'LastUpdated'  with .find but not sure about how to isolate the
> number. 'LastUpdated' is guaranteed to occur only once. Would
> appreciate it if one of you string parsing whizzes would take a stab
> at it.
>

Does this help?

In [7]: s = ''

In [8]: int(s.split("=")[-1].split('"')[1])
Out[8]: 1178658863

There's probably a hundred different ways of doing this, but this is
the first that came to mind.

Cheers,

Tim

> Thanks,
>
> jh
>
>
>
> 
> 
> 
> 
> 
> 
> 
>  align="center"
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: String parsing

2007-05-08 Thread Gabriel Genellina

En Tue, 08 May 2007 22:09:52 -0300, HMS Surprise <[EMAIL PROTECTED]>  
escribió:

> The string below is a piece of a longer string of about 2
> characters returned from a web page. I need to isolate the number at
> the end of the line containing 'LastUpdated'. I can find
> 'LastUpdated'  with .find but not sure about how to isolate the
> number. 'LastUpdated' is guaranteed to occur only once. Would
> appreciate it if one of you string parsing whizzes would take a stab
> at it.

> 
> 
> 
> 
> 
> 
> 
>  align="center"

You really should use an html parser here. But assuming that the page will  
not change a lot its structure you could use a regular expression like  
this:

expr = re.compile(r'name\s*=\s*"LastUpdated"\s+value\s*=\s*"(.*?)"',  
re.IGNORECASE)
number = expr.search(text).group(1)
(Handling of "not found" and "duplicate" cases is left as an exercise for  
the reader)

Note that  is  
as valid as your html, but won't match the expression.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list

String parsing

2007-05-08 Thread HMS Surprise


The string below is a piece of a longer string of about 2
characters returned from a web page. I need to isolate the number at
the end of the line containing 'LastUpdated'. I can find
'LastUpdated'  with .find but not sure about how to isolate the
number. 'LastUpdated' is guaranteed to occur only once. Would
appreciate it if one of you string parsing whizzes would take a stab
at it.

Thanks,

jh










http://mail.python.org/mailman/listinfo/python-list

38 matches

Mail list logo