Re: instaparse: composing smaller rules into a bigger one

Mark Engelberg Mon, 18 Nov 2013 16:39:11 -0800

Seems like there are (at least) two issues here.

1.  You have a preference in mind that is not expressed by the grammar.
The parse that was outputted is a valid parse that fits all the rules of
the grammar.  If you want the parser to prefer DRUGPK and EFF
interpretations over other interpretations, you need to specify that, for
example:
   TOKEN = DRUGPK / EFF / (NUM | DRUG | PK | MECH | SIGN | ENCLOSED) / WORD


2.  Your rule for space is "<SPACE> = #'\\s+'", i.e., one or more spaces.
But the way your other rules utilize the SPACE rule, this causes a
problem.  For example, you define DRUGPK as ending with SPACE (and that
ending SPACE is part of the DRUGPK token), but your S rule also says that
tokens (including DRUGPK) must be *followed* by a SPACE.  So the DRUGPK
rule will never be satisfied, because it is including the ending whitespace
as part of the token, and then there's no whitespace following the token as
required by the S rule.  As another example, your EFF rule begins "BE?
SPACE SIGN? SPACE MECH" and if the optional BE and SIGN aren't present,
it's looking for two mandatory spaces in a row.

I suggest changing your rule to "<SPACE> = #'\\s*'", i.e., zero or more
spaces.  Or if you don't actually care about seeing the spaces in your
parse output, you can change it to "<SPACE> = <#'\\s*'>".

If you make both those changes, you'll get:

=> (parsePK "Exposure to didanosine is increased when coadministered with
tenofovir disoproxil fumarate [Table 5 and see Clinical Pharmacokinetics
(12.3, Tables 9 and 10)].")
[:S [:TOKEN [:DRUGPK [:PK "Exposure"] "to" [:DRUG "didanosine"] [:EFF "is"
[:MECH "increased"]]]] [:TOKEN "when"] [:TOKEN [:EFF [:MECH
"coadministered"]]] [:TOKEN "with"] [:TOKEN [:DRUG "tenofovir"]] [:TOKEN
"disoproxil"] [:TOKEN "fumarate"] [:TOKEN [:ENCLOSED "[Table 5 and see
Clinical Pharmacokinetics (12.3, Tables 9 and 10)]"]] [:END "."]]

which I think is what you want.

If you have follow-up questions, I recommend posting to the instaparse
google group:
https://groups.google.com/forum/#!forum/instaparse

--Mark

P.S.  I've been experimenting with a feature to make it easier to express
grammars where you find yourself inserting an optional whitespace rule
everywhere, documented here under:
https://github.com/Engelberg/instaparse/blob/master/docs/ExperimentalFeatures.md#auto-whitespace


On Mon, Nov 18, 2013 at 5:47 AM, Jim - FooBar(); <jimpil1...@gmail.com>wrote:

>  Hi all,
>
> I'm having a small problem composing smaller matches in instaparse. Here
> is what I'm trying...just observe the bold bits:
>
> (def parsePK
>   (insta/parser
>    "S  = TOKEN (SPACE TOKEN PUNCT?)* END
>    TOKEN = (NUM | DRUG | PK | DRUGPK | MECH | SIGN | EFF | ENCLOSED) /
> WORD
>    <WORD> = #'\\w+' | PUNCT
>    <PUNCT> = #'\\p{Punct}'
>    ENCLOSED = PAREN | SQBR
>    <PAREN> = #'\\[.*\\]'
>    <SQBR> =  #'\\(.*\\)'
>     NUM =  #'[0-9]+'
>     ADV =   #'[a-z]+ly'
>    <SPACE> = #'\\s+'
>     DRUG =  #'(?i)didanosine|quinidine|tenofovir'
>     PK =    #'(?i)exposure|bioavailability|lower?[\\s|\\-]?clearance'
>     *DRUGPK =  PK SPACE TO SPACE DRUG SPACE EFF? SPACE *
>     MECH =  #'[a-z]+e(s|d)'
>     *EFF = BE? SPACE SIGN? SPACE MECH | BE? SPACE MECH SPACE ADV? *
>     SIGN =  ADV | NEG
>     NEG = 'not'
>     <TO> = 'to' | 'of'
>     <BE> = 'is' | 'are' | 'was' | 'were'
>     END =  '.' " ))
>
> Running the parser returns the following. It seems that the 2 bigger
> composite rules DRUGPK & EFF are not recognised at all. Only the smaller
> pieces are actually shown. I would expect [:TOKEN [:DRUGPK "Exposure to
> didanosine is increased"]] and  [:TOKEN [:EFF "is increased"]] entries.
> (pprint
> (parsePK "Exposure to didanosine is increased when coadministered with
> tenofovir disoproxil fumarate [Table 5 and see Clinical Pharmacokinetics
> (12.3, Tables 9 and 10)]."))
>
>
> [:S
>  [:TOKEN [:PK "Exposure"]]
>  " "
>  [:TOKEN "to"]
>  " "
>  [:TOKEN [:DRUG "didanosine"]]
>  " "
>  [:TOKEN "is"]
>  " "
>  [:TOKEN [:MECH "increased"]]
>  " "
>  [:TOKEN "when"]
>  " "
>  [:TOKEN [:MECH "coadministered"]]
>  " "
>  [:TOKEN "with"]
>  " "
>  [:TOKEN [:DRUG "tenofovir"]]
>  ","
>  " "
>  [:TOKEN "disoproxil"]
>  " "
>  [:TOKEN "fumarate"]
>  [:END "."]]
>
>  Am I thinking about it the wrong way? Can ayone shed some light?
>
> many thanks in advance,
>
> Jim
>
>
>
>
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: instaparse: composing smaller rules into a bigger one

Reply via email to