Re: [Tutor] Parsing problem

Liam Clarke Mon, 25 Jul 2005 05:38:38 -0700

Hi Paul,

Well various tweaks and such done, it parses perfectly, so much thanks, I think I now have a rough understanding of the basics of pyparsing.

Now, onto the fun part of optimising it. At the moment, I'm looking at 2 - 5 minutes to parse a 2000 line country section, and that's with psyco. Only problem is, I have 157 country sections...

I am running a 650 MHz processor, so that isn't helping either. I read this quote on
http://pyparsing.sourceforge.net.

"Thanks again for your help and thanks for writing pyparser! It seems my code needed to be optimized and now I am able to parse a 200mb file in 3 seconds. Now I can stick my tongue out at the Perl guys ;)"

I'm jealous, 200mb in 3 seconds, my file's only 4mb.

Are there any general approaches to optimisation that work well?

My current thinking is to use string methods to split the string into each component section, and then parse each section to a bare minimum key, value. ie - instead of parsing

x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}

out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}"

I'm thinking that would avoid the complicated nested structure I have now, and I could parse data
out of the string as needed, if needed at all.

Erk, I don't know, I've never had to optimise anything.

Much thanks for creating pyparsing, and doubly thank-you for your assistance in learning how to use it.

Regards,

Liam Clarke

On 7/25/05, Liam Clarke <[EMAIL PROTECTED]> wrote:

Hi Paul,

My apologies, as I was jumping into my car after sending that email, it clicked in my brain.
"Oh yeah... initial & body..."

But good to know about how to accept valid numbers.

Sorry, getting a bit too quick to fire off emails here.

Regards,

Liam Clarke

On 7/25/05, Paul McGuire < [EMAIL PROTECTED]> wrote:
Liam -

The two arguments to Word work this way:
- the first argument lists valid *initial* characters
- the second argument lists valid *body* or subsequent characters

For example, in the identifier definition,

identifier = pp.Word(pp.alphas, pp.alphanums + "_/:.")

identifiers *must* start with an alphabetic character, and then may be
followed by 0 or more alphanumeric or _/: or . characters.  If only one
argument is supplied, then the same string of characters is used as both
initial and body.  Identifiers are very typical for 2 argument Word's, as
they often start with alphas, but then accept digits and other punctuation.
No whitespace is permitted within a Word.  The Word matching will end when a
non-body character is seen.

Using this definition:

integer = pp.Word(pp.nums+"-+.", pp.nums)

It will accept "+123", "-345", "678", and ".901".  But in a real number, a
period may occur anywhere in the number, not just as the initial character,
as in "3.14159".  So your bodyCharacters must also include a ".", as in:

integer = pp.Word(pp.nums+"-+.", pp.nums+".")

Let me say, though, that this is a very permissive definition of integer -
for one thing, we really should rename it something like "number", since it
now accepts non-integers as well!  But also, there is no restriction on the
frequency of body characters.  This definition would accept a "number" that
looks like "3.4.3234.111.123.3234".  If you are certain that you will only
receive valid inputs, then this simple definition will be fine.  But if you
will have to handle and reject erroneous inputs, then you might do better
with a number definition like:

number = Combine( Word( "+-"+nums, nums ) +
                  Optional( point + Optional( Word( nums ) ) ) )

This will handle "+123", "-345", "678", and "0.901", but not ".901".  If you
want to accept numbers that begin with "."s, then you'll need to tweak this
a bit further.

One last thing: you may want to start using setName() on some of your
expressions, as in:

number = Combine( Word( "+-"+nums, nums ) +
                  Optional( point + Optional( Word( nums ) ) )
).setName("number")

Note, this is *not* the same as setResultsName.  Here setName is attaching a
name to this pattern, so that when it appears in an exception, the name will
be used instead of an encoded pattern string (such as W:012345...).  No need
to do this for Literals, the literal string is used when it appears in an
exception.

-- Paul

--

'There is only one basic human right, and that is to do as you damn well please.
And with it comes the only basic human duty, to take the consequences.'

--
'There is only one basic human right, and that is to do as you damn well please.
And with it comes the only basic human duty, to take the consequences.'

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Parsing problem

Reply via email to