[Haskell-cafe] Parsec line number off-by-one

2011-09-20 Thread Ben Gamari
Recently I've been playing around with Parsec for a simple parsing
project. While I was able to quickly construct my grammar (simplified
version attached), getting it working has been a bit tricky. In
particular, I am now stuck trying to figure out why Parsec is
mis-reporting line numbers. Parsec seems convinced that line 12 of my
input (also attached) has a "%" character,

  $ runghc Test.hs
  Left "(unknown)" (line 12, column 1):
  unexpected "%"
  expecting space or atom name

while my file clearly disagrees,

  10  %FLAG ATOM_NAME   
  
  11  %FORMAT(20a4) 
  
  12  C1  H1  C2  H2  C3  H3  C4  H4  C5  C6  C7  C8  N1  C9  H9  C10 H10 C11 
H11 C12 
  13  H12 C13 H13 C14 C15 N2  C16 C17 C29 H18 C19 H19 C20 H20 C21 H21 C22 
H221H222H223
  ...
  18  %FLAG CHARGE
  19  %FORMAT(5E16.8)   
  

The task here is to identify the block of data lines (lines 12-17),
ending at the beginning of the next block (starting with "%"). It seems
likely that my problem stems from the fact that I use "try" to
accomplish this but this is as far as I can reason.

Any ideas what might cause this sort of off-by-one? Does anyone see a
better (i.e. working) way to formulate my grammar? Any and all help
would be greatly appreciated. Thanks.

Cheers,

- Ben


module Main(main) where

import Debug.Trace
import Data.Maybe
import Text.ParserCombinators.Parsec
import Text.Parsec.Prim (ParsecT)
import qualified Text.ParserCombinators.Parsec.Token as P
import Text.ParserCombinators.Parsec.Language (emptyDef)

data PrmTopBlock = AtomNames [String]
 deriving (Show, Eq)

countBetween m n p = do xs <- count m p
ys <- count (n - m) $ option Nothing $ do
y <- p
return (Just y)
return (xs ++ catMaybes ys)

restLine = do a <- many (noneOf "\n")
  eol
  return a

eol = do skipMany $ char ' '
 char '\n'

natural = P.integer $ P.makeTokenParser emptyDef

float = do sign <- option 1 (do s <- oneOf "+- "
return $ if s == '-' then -1 else 1)
   x <- P.float $ P.makeTokenParser emptyDef
   return $ sign * x

flagDecl x = do string $ "%FLAG " ++ x
eol

formatDecl = do string "%FORMAT("
count <- many1 digit
format <- letter
length <- many1 digit
char ')'
eol
return (count, format, length)

-- |Multiple lines of lists of a given item
linesOf item = do ls <- many1 $ try (do lookAhead (noneOf "%")
l <- many1 item
eol
return $ trace (show l) l)
  return $ concat ls

atomNameBlock = do flagDecl "ATOM_NAME"
   formatDecl
   atomNames <- linesOf atomName
   return $ AtomNames atomNames
where
atomName = do spaces
  name <- countBetween 1 4 (alphaNum <|> oneOf "\'+-")  "atom name"
  return name

ignoredBlock = do string "%FLAG"  "ignored block flag"
  restLine
  formatDecl
  skipMany (noneOf "%" >> restLine)

hello = do ignoredBlock
   ignoredBlock
   atomNameBlock

parsePrmTopFile input = parse hello "(unknown)" input

test = do a <- readFile "test.prmtop"
  print $ parsePrmTopFile a

main = test


test.prmtop
Description: Binary data
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parsec line number off-by-one

2011-09-20 Thread Roman Cheplyaka
Hi Ben,

This is indeed a bug in parsec.

I have written a patch that fixes this. Currently Antoine Latter (current
parsec's maintainer) and I are working on getting these patches into the
next parsec release.

As a workaround until then, you can apply the attached patch manually.

  darcs get http://code.haskell.org/parsec3
  cd parsec3
  darcs apply parsec.dpatch
  cabal install

With this patch, the error message is:

  Left "(unknown)" (line 18, column 1):
  expecting space or atom name


* Ben Gamari  [2011-09-20 23:32:34-0400]
> Recently I've been playing around with Parsec for a simple parsing
> project. While I was able to quickly construct my grammar (simplified
> version attached), getting it working has been a bit tricky. In
> particular, I am now stuck trying to figure out why Parsec is
> mis-reporting line numbers. Parsec seems convinced that line 12 of my
> input (also attached) has a "%" character,
> 
>   $ runghc Test.hs
>   Left "(unknown)" (line 12, column 1):
>   unexpected "%"
>   expecting space or atom name
> 
> while my file clearly disagrees,
> 
>   10  %FLAG ATOM_NAME 
> 
>   11  %FORMAT(20a4)   
> 
>   12  C1  H1  C2  H2  C3  H3  C4  H4  C5  C6  C7  C8  N1  C9  H9  C10 H10 C11 
> H11 C12 
>   13  H12 C13 H13 C14 C15 N2  C16 C17 C29 H18 C19 H19 C20 H20 C21 H21 C22 
> H221H222H223
>   ...
>   18  %FLAG CHARGE
>   19  %FORMAT(5E16.8) 
> 
> 
> The task here is to identify the block of data lines (lines 12-17),
> ending at the beginning of the next block (starting with "%"). It seems
> likely that my problem stems from the fact that I use "try" to
> accomplish this but this is as far as I can reason.
> 
> Any ideas what might cause this sort of off-by-one? Does anyone see a
> better (i.e. working) way to formulate my grammar? Any and all help
> would be greatly appreciated. Thanks.
> 
> Cheers,
> 
> - Ben
> 
> 



> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe


-- 
Roman I. Cheplyaka :: http://ro-che.info/
4 patches for repository http://code.haskell.org/parsec3:

Sun Feb 20 18:24:22 EET 2011  Roman Cheplyaka 
  * Choose the longest match when merging error messages

Sun Feb 20 18:24:49 EET 2011  Roman Cheplyaka 
  * try: do not reset the error position

Sun Feb 20 18:29:20 EET 2011  Roman Cheplyaka 
  * lookAhead: do not consume input on success; update documentation

Sun Feb 20 19:30:26 EET 2011  Roman Cheplyaka 
  * Improve 

New patches:

[Choose the longest match when merging error messages
Roman Cheplyaka **20110220162422
 Ignore-this: 54e2733159a1574abb229e09ff6935c1
] hunk ./Text/Parsec/Error.hs 137
 = ParseError pos (msg : filter (msg /=) msgs)
 
 mergeError :: ParseError -> ParseError -> ParseError
-mergeError (ParseError pos msgs1) (ParseError _ msgs2)
-= ParseError pos (msgs1 ++ msgs2)
+mergeError (ParseError pos1 msgs1) (ParseError pos2 msgs2)
+= case pos1 `compare` pos2 of
+-- select the longest match
+EQ -> ParseError pos1 (msgs1 ++ msgs2)
+GT -> ParseError pos1 msgs1
+LT -> ParseError pos2 msgs2
 
 instance Show ParseError where
 show err
[try: do not reset the error position
Roman Cheplyaka **20110220162449
 Ignore-this: 8508bc41fc6dcd9b7c06aac762f12c71
] hunk ./Text/Parsec/Prim.hs 435
 
 try :: ParsecT s u m a -> ParsecT s u m a
 try p =
-ParsecT $ \s@(State _ pos _) cok _ eok eerr ->
-let pcerr parseError = eerr $ setErrorPos pos parseError 
-in unParser p s cok pcerr eok eerr
+ParsecT $ \s cok _ eok eerr ->
+unParser p s cok eerr eok eerr
 
 -- | The parser @tokenPrim showTok posFromTok testTok@ accepts a token @t@
 -- with result @x@ when the function @testTok t@ returns @'Just' x@. The
[lookAhead: do not consume input on success; update documentation
Roman Cheplyaka **20110220162920
 Ignore-this: e884771490209b93e9fec044543a18ef
] {
hunk ./Text/Parsec/Combinator.hs 279
 <|>
   do{ x <- p; xs <- scan; return (x:xs) }
 
--- | @lookAhead p@ parses @p@ without consuming any input.
-
-lookAhead :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m a
-lookAhead p = do{ state <- getParserState
-; x <- p
-; setParserState state
-; return x
-}
-
hunk ./Text/Parsec/Prim.hs 40
 , (<|>)
 , label
 , labels
+, lookAhead
 , Stream(..)
 , tokens
 , try
hunk ./Text/Parsec/Prim.hs 439
 ParsecT $ \s cok _ eok eerr ->
 unParser p s cok eerr eok eerr
 
+-- | @lookAhead p@ parses @p@ without consuming any input.
+--
+-- If @p@ fails and consumes some input, so does @lookAhead@. Combine with 
'try'
+-- if this is undesirable.
+

Re: [Haskell-cafe] Parsec line number off-by-one

2011-09-21 Thread Christian Maeder

Hi,

1. your "lookAhead" is unnecessary, because your items (atomNames) never 
start with "%".


2. your "try" fails in (line 12, column 1), because the last item (aka 
atomName) starts consuming "\n", before your eol parser is called.


So rather than calling spaces before every real atom, I would call it 
after every real atom and after your formatDecl (so before your linesOf 
parser).


atomNameBlock = do flagDecl "ATOM_NAME"
   formatDecl
   spaces
   atomNames <- many1 atomName
   return $ AtomNames atomNames
where
atomName = do
  name <- countBetween 1 4 (alphaNum <|> 
oneOf "\'+-")  "atom name"

  spaces
  return name

Since spaces also consume "\n", linesOf can just be "many1"!

HTH Christian


Am 21.09.2011 05:32, schrieb Ben Gamari:

Recently I've been playing around with Parsec for a simple parsing
project. While I was able to quickly construct my grammar (simplified
version attached), getting it working has been a bit tricky. In
particular, I am now stuck trying to figure out why Parsec is
mis-reporting line numbers. Parsec seems convinced that line 12 of my
input (also attached) has a "%" character,

   $ runghc Test.hs
   Left "(unknown)" (line 12, column 1):
   unexpected "%"
   expecting space or atom name

while my file clearly disagrees,

   10  %FLAG ATOM_NAME
   11  %FORMAT(20a4)
   12  C1  H1  C2  H2  C3  H3  C4  H4  C5  C6  C7  C8  N1  C9  H9  C10 H10 C11 
H11 C12
   13  H12 C13 H13 C14 C15 N2  C16 C17 C29 H18 C19 H19 C20 H20 C21 H21 C22 
H221H222H223
   ...
   18  %FLAG CHARGE
   19  %FORMAT(5E16.8)

The task here is to identify the block of data lines (lines 12-17),
ending at the beginning of the next block (starting with "%"). It seems
likely that my problem stems from the fact that I use "try" to
accomplish this but this is as far as I can reason.

Any ideas what might cause this sort of off-by-one? Does anyone see a
better (i.e. working) way to formulate my grammar? Any and all help
would be greatly appreciated. Thanks.

Cheers,

- Ben





___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parsec line number off-by-one

2011-09-21 Thread Ben Gamari
On Wed, 21 Sep 2011 11:27:31 +0200, Christian Maeder  
wrote:
> Hi,
> 
> 1. your "lookAhead" is unnecessary, because your items (atomNames) never 
> start with "%".
> 
I see.

> 2. your "try" fails in (line 12, column 1), because the last item (aka 
> atomName) starts consuming "\n", before your eol parser is called.
> 
Ahh, this is a good point. I for some reason seeded the thought in my
mind that spaces takes the ' ' character, not '\n'.

> So rather than calling spaces before every real atom, I would call it 
> after every real atom and after your formatDecl (so before your linesOf 
> parser).
> 
Excellent solution. I appreciate your help. That would have taken me
quite a bit of head-banging to find.

Cheers,

- Ben

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Parsec line number off-by-one

2011-09-21 Thread Ben Gamari
On Wed, 21 Sep 2011 09:37:40 +0300, Roman Cheplyaka  wrote:
> Hi Ben,
> 
> This is indeed a bug in parsec.
> 
Ahh good. I'm glad I'm not crazy. Given that it seems the lookahead is
actually unnecessary, looks like I can skip the patch too. Thanks for
your reply!

Cheers,

- Ben

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe