Re: [Haskell-cafe] simple parsec question

2013-03-05 Thread S. Doaitse Swierstra
Maybe this is something you do not even want to use a parser combinator library 
for. The package

http://hackage.haskell.org/packages/archive/list-grouping/0.1.1/doc/html/Data-List-Grouping.html

contains a function breakBefore, so you can write

main = do inp <- readFile ...
 let result = map mkSection . breakBefore ((= ':').last)). 
lines $ inp

mkSection (l:ll) = Section (Top l) (Contents ll)

Doaitse


On Mar 3, 2013, at 16:44 , Immanuel Normann  
wrote:

> Hi,
> 
> I am trying to parse a semi structured text with parsec that basically should 
> identify sections. Each section starts with a headline and has an 
> unstructured content - that's all. For instance, consider the following 
> example text (inside the dashed lines):
> 
> ---
> 
> top 1:
> 
> some text ... bla
> 
> top 2:
> 
> more text ... bla bla
> 
> 
> ---
> 
> This should be parsed into a structure like this:
> 
> [Section (Top 1) (Content "some text ... bla"), Section (Top 1) (Content 
> "more text ... bla")]
> 
> Say, I have a parser "headline", but the content after a headline could be 
> anything that is different from what "headline" parses.
> How could the "section" parser making use of "headline" look like?
> My idea would be to use the "manyTill" combinator, but I don"t find an easy 
> solution.
> 
> Many thanks for any hint
> 
> Immanuel
> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] simple parsec question

2013-03-05 Thread Immanuel Normann
Carlo,

Thanks a lot! This looks very promising (though I have to test it for my
purpose more in depth). As you mention, the key seems to be the optionMaybe
combinator. Thanks for pointing to it.

Immanuel


2013/3/5 Carlo Hamalainen 

> On Mon, Mar 4, 2013 at 1:44 AM, Immanuel Normann <
> immanuel.norm...@googlemail.com> wrote:
>
>> I am trying to parse a semi structured text with parsec that basically
>> should identify sections. Each section starts with a headline and has an
>> unstructured content - that's all.
>>
>
> Here's my attempt: https://gist.github.com/carlohamalainen/5087207
>
> {-# LANGUAGE FlexibleContexts #-}
>
> import Text.Parsec
> import Control.Applicative hiding ((<|>),many)
>
> -- Example input:
>
> {-
> top 1:
>
> some text ... bla
>
> top 2:
>
> more text ... bla bla
>
> -}
>
> data Top = Top String deriving (Show)
> data Content = Content [String] deriving (Show)
> data Section = Section Top Content deriving (Show)
>
> headline = do
> t <- many1 (noneOf ":\n")
> char ':'
> newline
>
> return $ Top t
>
> contentLine = do
> x <- many (noneOf ":\n")
> newline
> return x
>
> content = do
> line <- optionMaybe (try contentLine)
>
> case line of Just x -> do xs <- content
>   return (x:xs)
>  _  -> return []
>
> section = do
> h <- headline
> c <- Content <$> content
> return $ Section h c
>
> main = do
> x <- readFile "simple.txt"
> print $ parse (many section) "" x
>
>
> Example run using your sample data:
>
> $ runhaskell Simple.hs
> Right [Section (Top "top 1") (Content ["","some text ... bla",""]),Section
> (Top "top 2") (Content ["","more text ... bla bla",""])]
>
> Notes:
>
> * I had to assume that a content line does not contain a ':', because that
> is the only way to distinguish a head-line (correct me if I'm wrong).
>
> * The key was to use optionMaybe along with try; see the definition of
> content.
>
> * I haven't tested this code on very large inputs.
>
> * I slightly changed the definition of Content to have a list of Strings,
> one for each line. I'm sure this could be altered if you wanted to retain
> all whitespace.
>
> * I am still new to Parsec, so don't take this as the definitive answer ;-)
>
> --
> Carlo Hamalainen
> http://carlo-hamalainen.net
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] simple parsec question

2013-03-04 Thread Carlo Hamalainen
On Mon, Mar 4, 2013 at 1:44 AM, Immanuel Normann <
immanuel.norm...@googlemail.com> wrote:

> I am trying to parse a semi structured text with parsec that basically
> should identify sections. Each section starts with a headline and has an
> unstructured content - that's all.
>

Here's my attempt: https://gist.github.com/carlohamalainen/5087207

{-# LANGUAGE FlexibleContexts #-}

import Text.Parsec
import Control.Applicative hiding ((<|>),many)

-- Example input:

{-
top 1:

some text ... bla

top 2:

more text ... bla bla

-}

data Top = Top String deriving (Show)
data Content = Content [String] deriving (Show)
data Section = Section Top Content deriving (Show)

headline = do
t <- many1 (noneOf ":\n")
char ':'
newline

return $ Top t

contentLine = do
x <- many (noneOf ":\n")
newline
return x

content = do
line <- optionMaybe (try contentLine)

case line of Just x -> do xs <- content
  return (x:xs)
 _  -> return []

section = do
h <- headline
c <- Content <$> content
return $ Section h c

main = do
x <- readFile "simple.txt"
print $ parse (many section) "" x


Example run using your sample data:

$ runhaskell Simple.hs
Right [Section (Top "top 1") (Content ["","some text ... bla",""]),Section
(Top "top 2") (Content ["","more text ... bla bla",""])]

Notes:

* I had to assume that a content line does not contain a ':', because that
is the only way to distinguish a head-line (correct me if I'm wrong).

* The key was to use optionMaybe along with try; see the definition of
content.

* I haven't tested this code on very large inputs.

* I slightly changed the definition of Content to have a list of Strings,
one for each line. I'm sure this could be altered if you wanted to retain
all whitespace.

* I am still new to Parsec, so don't take this as the definitive answer ;-)

-- 
Carlo Hamalainen
http://carlo-hamalainen.net
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] simple parsec question

2013-03-04 Thread Immanuel Normann
Andrey,
Thanks a lot for your effort! I have the same suspect that the lookahead in
the content parser is the problem, but I don't know how to solve it either.
At least the I learned from your code that noneOf is also a quite useful
parser in this context which I have overlooked.
Anyway, if you find a solution it would be great! In the end the task
itself doesn't look very specific, but rather general: an alternation
between strictly (the headline in my case) and loosely (the content in my
case) structured text. It shouldn't be difficult to build a parser for such
a setting.

(btw. I am aware the my test parser would (or rather should) parse only the
first section. For testing this would be sufficient.)



2013/3/4 Andrey Chudnov 

>  Immanuel,
> I tried but I couldn't figure it out. Here's a gist with my attempts and
> results so far: https://gist.github.com/achudnov/f3af65f11d5162c73064There, 
> 'test' uses my attempt at specifying the parser, 'test2' uses yours.
> Note that your attempt wouldn't parse multiple sections -- for that you
> need to use 'many section' instead of just 'section' in 'parse'
> ('parseFromFile' in the original).
> I think what's going on is the lookahead is wrong, but I'm not sure how
> exactly. I'll give it another go tomorrow if I have time.
>
> /Andrey
>
>
> On 03/03/2013 05:16 PM, Immanuel Normann wrote:
>
>Andrey,
>
>  Thanks for your attempt, but it doesn't seem to work. The easy part is
> the headline, but the content makes trouble.
>
> Let me write the code a bit more explicit, so you can copy and paste it:
>
> --
> {-# LANGUAGE FlexibleContexts #-}
>
> module Main where
>
> import Text.Parsec
>
> data Top = Top String deriving (Show)
> data Content = Content String deriving (Show)
> data Section = Section Top Content deriving (Show)
>
> headline :: Stream s m Char => ParsecT s u m Top
> headline = manyTill anyChar (char ':' >> newline) >>= return . Top
>
> content :: Stream s m Char => ParsecT s u m Content
> content = manyTill anyChar (try headline) >>= return . Content
>
> section :: Stream s m Char => ParsecT s u m Section
> section = do {h <- headline; c <- content; return (Section h c)}
> --
>
>
>  Assume the following example text is stored in  "/tmp/test.txt":
> ---
> top 1:
>
> some text ... bla
>
> top 2:
>
> more text ... bla bla
> ---
>
>  Now I run the section parser in ghci against the above mentioned example
> text stored in "/tmp/test.txt":
>
> *Main> parseFromFile section "/tmp/test.txt"
> Right (Section (Top "top 1") (Content ""))
>
>  I don't understand the behaviour of the content parser here. Why does it
> return ""? Or perhaps more generally, I don't understand the manyTill
> combinator (though I read the docs).
>
>  Side remark: of cause for this little task it is probably to much effort
> to use parsec. However, my content in fact has an internal structure which
> I would like to parse further, but I deliberately abstracted from these
> internals as they don't effect my above stated problem.
>
>  Immanuel
>
>
> 2013/3/3 Andrey Chudnov 
>
>> Immanuel,
>> Since a heading always starts with a new line (and ends with a colon
>> followed by a carriage return or just a colon?), I think it might be useful
>> to first separate the input into lines and then classify them depending on
>> whether it's a heading or not and reassemble them into the value you need.
>> You don't even need parsec for that.
>>
>> However, if you really want to use parsec, you can write something like
>> (warning, not tested):
>> many $ liftM2 Section headline content
>>where headline = anyChar `manyTill` (char ':' >> spaces >> newline)
>>content  = anyChar `manyTill` (try $ newline >> headline)
>>
>> /Andrey
>>
>>
>> On 3/3/2013 10:44 AM, Immanuel Normann wrote:
>>
>>> I am trying to parse a semi structured text with parsec that basically
>>> should identify sections. Each section starts with a headline and has an
>>> unstructured content - that's all. For instance, consider the following
>>> example text (inside the dashed lines):
>>>
>>> ---
>>>
>>> top 1:
>>>
>>> some text ... bla
>>>
>>> top 2:
>>>
>>> more text ... bla bla
>>>
>>>
>>> ---
>>>
>>> This should be parsed into a structure like this:
>>>
>>> [Section (Top 1) (Content "some text ... bla"), Section (Top 1) (Content
>>> "more text ... bla")]
>>>
>>> Say, I have a parser "headline", but the content after a headline could
>>> be anything that is different from what "headline" parses.
>>> How could the "section" parser making use of "headline" look like?
>>> My idea would be to use the "manyTill" combinator, but I don"t find an
>>> easy solution.
>>>
>>
>
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] simple parsec question

2013-03-03 Thread Andrey Chudnov
Immanuel,
I tried but I couldn't figure it out. Here's a gist with my attempts and
results so far: https://gist.github.com/achudnov/f3af65f11d5162c73064
There, 'test' uses my attempt at specifying the parser, 'test2' uses
yours. Note that your attempt wouldn't parse multiple sections -- for
that you need to use 'many section' instead of just 'section' in 'parse'
('parseFromFile' in the original).
I think what's going on is the lookahead is wrong, but I'm not sure how
exactly. I'll give it another go tomorrow if I have time.

/Andrey

On 03/03/2013 05:16 PM, Immanuel Normann wrote:
> Andrey,
>
> Thanks for your attempt, but it doesn't seem to work. The easy part is
> the headline, but the content makes trouble.
>
> Let me write the code a bit more explicit, so you can copy and paste it:
>
> --
> {-# LANGUAGE FlexibleContexts #-}
>
> module Main where
>
> import Text.Parsec
>
> data Top = Top String deriving (Show)
> data Content = Content String deriving (Show)
> data Section = Section Top Content deriving (Show)
>
> headline :: Stream s m Char => ParsecT s u m Top
> headline = manyTill anyChar (char ':' >> newline) >>= return . Top
>
> content :: Stream s m Char => ParsecT s u m Content
> content = manyTill anyChar (try headline) >>= return . Content
>
> section :: Stream s m Char => ParsecT s u m Section
> section = do {h <- headline; c <- content; return (Section h c)}
> --
>
>
> Assume the following example text is stored in  "/tmp/test.txt":
> ---
> top 1:
>
> some text ... bla
>
> top 2:
>
> more text ... bla bla
> ---
>
> Now I run the section parser in ghci against the above mentioned
> example text stored in "/tmp/test.txt":
>
> *Main> parseFromFile section "/tmp/test.txt"
> Right (Section (Top "top 1") (Content ""))
>
> I don't understand the behaviour of the content parser here. Why does
> it return ""? Or perhaps more generally, I don't understand the
> manyTill combinator (though I read the docs).
>
> Side remark: of cause for this little task it is probably to much
> effort to use parsec. However, my content in fact has an internal
> structure which I would like to parse further, but I deliberately
> abstracted from these internals as they don't effect my above stated
> problem.
>
> Immanuel
>
>
> 2013/3/3 Andrey Chudnov mailto:achud...@gmail.com>>
>
> Immanuel,
> Since a heading always starts with a new line (and ends with a
> colon followed by a carriage return or just a colon?), I think it
> might be useful to first separate the input into lines and then
> classify them depending on whether it's a heading or not and
> reassemble them into the value you need. You don't even need
> parsec for that.
>
> However, if you really want to use parsec, you can write something
> like (warning, not tested):
> many $ liftM2 Section headline content
>where headline = anyChar `manyTill` (char ':' >> spaces >> newline)
>content  = anyChar `manyTill` (try $ newline >>
> headline)
>
> /Andrey
>
>
> On 3/3/2013 10:44 AM, Immanuel Normann wrote:
>
> I am trying to parse a semi structured text with parsec that
> basically should identify sections. Each section starts with a
> headline and has an unstructured content - that's all. For
> instance, consider the following example text (inside the
> dashed lines):
>
> ---
>
> top 1:
>
> some text ... bla
>
> top 2:
>
> more text ... bla bla
>
>
> ---
>
> This should be parsed into a structure like this:
>
> [Section (Top 1) (Content "some text ... bla"), Section (Top
> 1) (Content "more text ... bla")]
>
> Say, I have a parser "headline", but the content after a
> headline could be anything that is different from what
> "headline" parses.
> How could the "section" parser making use of "headline" look like?
> My idea would be to use the "manyTill" combinator, but I don"t
> find an easy solution.
>
>

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] simple parsec question

2013-03-03 Thread Immanuel Normann
Andrey,

Thanks for your attempt, but it doesn't seem to work. The easy part is the
headline, but the content makes trouble.

Let me write the code a bit more explicit, so you can copy and paste it:

--
{-# LANGUAGE FlexibleContexts #-}

module Main where

import Text.Parsec

data Top = Top String deriving (Show)
data Content = Content String deriving (Show)
data Section = Section Top Content deriving (Show)

headline :: Stream s m Char => ParsecT s u m Top
headline = manyTill anyChar (char ':' >> newline) >>= return . Top

content :: Stream s m Char => ParsecT s u m Content
content = manyTill anyChar (try headline) >>= return . Content

section :: Stream s m Char => ParsecT s u m Section
section = do {h <- headline; c <- content; return (Section h c)}
--


Assume the following example text is stored in  "/tmp/test.txt":
---
top 1:

some text ... bla

top 2:

more text ... bla bla
---

Now I run the section parser in ghci against the above mentioned example
text stored in "/tmp/test.txt":

*Main> parseFromFile section "/tmp/test.txt"
Right (Section (Top "top 1") (Content ""))

I don't understand the behaviour of the content parser here. Why does it
return ""? Or perhaps more generally, I don't understand the manyTill
combinator (though I read the docs).

Side remark: of cause for this little task it is probably to much effort to
use parsec. However, my content in fact has an internal structure which I
would like to parse further, but I deliberately abstracted from these
internals as they don't effect my above stated problem.

Immanuel


2013/3/3 Andrey Chudnov 

> Immanuel,
> Since a heading always starts with a new line (and ends with a colon
> followed by a carriage return or just a colon?), I think it might be useful
> to first separate the input into lines and then classify them depending on
> whether it's a heading or not and reassemble them into the value you need.
> You don't even need parsec for that.
>
> However, if you really want to use parsec, you can write something like
> (warning, not tested):
> many $ liftM2 Section headline content
>where headline = anyChar `manyTill` (char ':' >> spaces >> newline)
>content  = anyChar `manyTill` (try $ newline >> headline)
>
> /Andrey
>
>
> On 3/3/2013 10:44 AM, Immanuel Normann wrote:
>
>> I am trying to parse a semi structured text with parsec that basically
>> should identify sections. Each section starts with a headline and has an
>> unstructured content - that's all. For instance, consider the following
>> example text (inside the dashed lines):
>>
>> ---
>>
>> top 1:
>>
>> some text ... bla
>>
>> top 2:
>>
>> more text ... bla bla
>>
>>
>> ---
>>
>> This should be parsed into a structure like this:
>>
>> [Section (Top 1) (Content "some text ... bla"), Section (Top 1) (Content
>> "more text ... bla")]
>>
>> Say, I have a parser "headline", but the content after a headline could
>> be anything that is different from what "headline" parses.
>> How could the "section" parser making use of "headline" look like?
>> My idea would be to use the "manyTill" combinator, but I don"t find an
>> easy solution.
>>
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] simple parsec question

2013-03-03 Thread Andrey Chudnov

Immanuel,
Since a heading always starts with a new line (and ends with a colon 
followed by a carriage return or just a colon?), I think it might be 
useful to first separate the input into lines and then classify them 
depending on whether it's a heading or not and reassemble them into the 
value you need. You don't even need parsec for that.


However, if you really want to use parsec, you can write something like 
(warning, not tested):

many $ liftM2 Section headline content
   where headline = anyChar `manyTill` (char ':' >> spaces >> newline)
   content  = anyChar `manyTill` (try $ newline >> headline)

/Andrey

On 3/3/2013 10:44 AM, Immanuel Normann wrote:
I am trying to parse a semi structured text with parsec that basically 
should identify sections. Each section starts with a headline and has 
an unstructured content - that's all. For instance, consider the 
following example text (inside the dashed lines):


---

top 1:

some text ... bla

top 2:

more text ... bla bla


---

This should be parsed into a structure like this:

[Section (Top 1) (Content "some text ... bla"), Section (Top 1) 
(Content "more text ... bla")]


Say, I have a parser "headline", but the content after a headline 
could be anything that is different from what "headline" parses.

How could the "section" parser making use of "headline" look like?
My idea would be to use the "manyTill" combinator, but I don"t find an 
easy solution.


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] simple parsec question

2013-03-03 Thread Immanuel Normann
Hi,

I am trying to parse a semi structured text with parsec that basically
should identify sections. Each section starts with a headline and has an
unstructured content - that's all. For instance, consider the following
example text (inside the dashed lines):

---

top 1:

some text ... bla

top 2:

more text ... bla bla


---

This should be parsed into a structure like this:

[Section (Top 1) (Content "some text ... bla"), Section (Top 1) (Content
"more text ... bla")]

Say, I have a parser "headline", but the content after a headline could be
anything that is different from what "headline" parses.
How could the "section" parser making use of "headline" look like?
My idea would be to use the "manyTill" combinator, but I don"t find an easy
solution.

Many thanks for any hint

Immanuel
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe