Re: Syntax error if paragraph contains more than 1 printable character

2023-12-14 Thread Steve Litt
James K. Lowden said on Wed, 13 Dec 2023 12:42:20 -0500
>Rather than remove trailing blanks from the input, I would remove them
>in flex.  The problem can be solved with regular expressions but,
>since we're only matching one value, it's easily done in an action: 
>
>   .+  {
>   for( auto p = yytext + yyleng - 1; p >= yytext; p-- ) {
>   if( *p != 0x20 ) break;
>   *p = '\0';
>   }

Nice!

>
>
>To solve it with regex, 
>
>   ([[:blank:]]*[[:^space:]])+ { ... return LINE; }
>   [[:blank:]]+$   // ignore

Nice!

Thanks James. I'll be investigating your techniques in the near future.
As you can see, your suggestions contributed to my working text to HTML
baby Hello World.

SteveT

Steve Litt 

Autumn 2023 featured book: Rapid Learning for the 21st Century
http://www.troubleshooters.com/rl21



Re: Syntax error if paragraph contains more than 1 printable character

2023-12-14 Thread James K. Lowden
On Wed, 13 Dec 2023 19:01:22 -0500
Steve Litt  wrote:

> >.+/\n  { ... return LINE; }
> >(\n[[:blank:]]*){2,} { return SEP; } // two or more blank lines
> >\n   { /* ignore */ }
> 
> Thanks James, this looks great!

You're welcome.  It occurs to me that

.+/\n

is the same as

.+

so, simpler still.  :-) 


> I won't need to consider end of line spaces because I now have a sed 1
> liner preprocessor that gets rid of trailing space :-).

Flex is a regex engine, and can do anything sed can do.  Your system is
simpler if it can deal with all acceptable input, without
preprocessing.  

Rather than remove trailing blanks from the input, I would remove them
in flex.  The problem can be solved with regular expressions but,
since we're only matching one value, it's easily done in an action: 

.+  {
for( auto p = yytext + yyleng - 1; p >= yytext; p-- ) {
if( *p != 0x20 ) break;
*p = '\0';
}


To solve it with regex, 

([[:blank:]]*[[:^space:]])+ { ... return LINE; }
[[:blank:]]+$   // ignore

--jkl





Re: Syntax error if paragraph contains more than 1 printable character

2023-12-13 Thread Steve Litt
James K. Lowden said on Tue, 12 Dec 2023 20:24:35 -0500

>On Tue, 12 Dec 2023 23:06:14 -0500
>Steve Litt  wrote:
>
>> I've already split paratext into multiple LINE tokens which represent
>> a line without its NL, and now I'm thinking of splitting line into
>> multiple chars ("[^\n]"). Perhaps this will make the rules less
>> complicated, though longer.  
>
>Have the scanner return two tokens only: 
>
>   LINE  a line of text, no newline
>   SEP   a blank line
>
>The lexer might have:
>
>.+/\n  { ... return LINE; }
>(\n[[:blank:]]*){2,} { return SEP; } // two or more blank lines
>\n   { /* ignore */ }

Thanks James, this looks great!

I won't need to consider end of line spaces because I now have a sed 1
liner preprocessor that gets rid of trailing space :-).

Right now I've gone back to the Hello World stage and am making a
Flex/Bison scanner that does nothing but copy the file. Once I learn
from that, I'll try your suggestions. They look refreshingly simple and
understandable to me.

Thanks much,

SteveT

Steve Litt 

Autumn 2023 featured book: Rapid Learning for the 21st Century
http://www.troubleshooters.com/rl21



Re: Syntax error if paragraph contains more than 1 printable character

2023-12-13 Thread James K. Lowden
On Tue, 12 Dec 2023 23:06:14 -0500
Steve Litt  wrote:

> I've already split paratext into multiple LINE tokens which represent
> a line without its NL, and now I'm thinking of splitting line into
> multiple chars ("[^\n]"). Perhaps this will make the rules less
> complicated, though longer.

Have the scanner return two tokens only: 

LINE  a line of text, no newline
SEP   a blank line

The lexer might have:

.+/\n  { ... return LINE; }
(\n[[:blank:]]*){2,} { return SEP; } // two or more blank lines
\n   { /* ignore */ }

Then your parser wants:

top:paragraphs
| paragraphs SEP // to allow for trailing blank lines
;
paragraphs: paragraph
| paragraphs SEP paragraph
;
paragraph: lines
;
lines: LINE
| lines LINE
;

I would think that would work.  

--jkl






Re: Syntax error if paragraph contains more than 1 printable character

2023-12-12 Thread Steve Litt
Thanks Chris,

Changing "." to either ".*", ".+", "[^\n]*" or "[^\n]+" all solve the
problem with multiple printables on one line, but still throw a syntax
error when two lines follow each other without an intervening
blankline. In other words, it calls a paragraph of more than one line
a syntax error.

I've already split paratext into multiple LINE tokens which represent a
line without its NL, and now I'm thinking of splitting line into
multiple chars ("[^\n]"). Perhaps this will make the rules less
complicated, though longer.

Thanks,

SteveT

Steve Litt 

Autumn 2023 featured book: Rapid Learning for the 21st Century
http://www.troubleshooters.com/rl21
 


Chris verBurg said on Tue, 12 Dec 2023 18:57:08 -0800

>Hey Steve,
>
>My reading of your code is that PARATEXT will only ever be a single
>character.  I'm thinking you want the flex rule to be ".*" (etc)
>instead of just ".".
>
>I'm curious whether your paragraphs are allowed to contain NLs.  If so,
>you're going to have to include them in the PARATEXT token value, and
>also update it to not match more than one.  I don't know offhand if
>there's some cleverness with trailing context that could be used there.
>
>-Chris
>
>
>On Tue, Dec 12, 2023 at 9:29 AM Steve Litt 
>wrote:
>
>> Hi all,
>>
>> I'm creating a parser that takes a text file whose paragraphs are
>> separated by blank lines. Unfortunately, if the input file contains a
>> paragraph with more than one non-space character, it gives me a
>> syntax error via yyerror(). So the following works, where = etc
>> are not in the inpyut file but just signify beginning and end of
>> file:
>>
>> 
>> a
>>
>> b
>>
>> c
>>
>> 
>>
>> The following throws a syntax error:
>>
>> 
>> a
>>
>> bx
>>
>> c
>>
>> 
>>
>> The following also throws a syntax error:
>>
>> 
>> a
>>
>> b
>> x
>>
>> c
>>
>> 
>>
>> I'd appreciate any guidance as to what is wrong with my Flex and
>> Bison programs (I suspect my rules in the rules section of Bison).
>> My Flex and Bison programs follow, once again delineated by lines of
>> equal signs that don't exist in the program:
>>
>> === Flex Program =
>> %option noinput nounput
>> %{
>> #include "paragraphs.tab.h"
>> %}
>>
>> %%
>>
>> [ \t]*\n{strcpy (yylval.y_char, yytext); return NL;  }
>> .   {strcpy (yylval.y_char, yytext); return PARATEXT; }
>>
>> %%
>>
>>
>> int yywrap(void)
>> {
>>  return 1;
>> }
>>
>> int yyerror(char *errormsg)
>> {
>>   fprintf(stderr, "%s\n", errormsg);
>>   exit(1);
>> }
>> 
>>
>>
>> === Bison Program =
>> %{
>>
>> #include 
>> #include 
>> int yylex(void);
>> int yyerror (char *errmsg);
>> #define EOF_ 0
>> %}
>>
>> %union {
>> chary_char [1];
>> }
>> %tokenPARATEXT
>> %tokenNL
>> %%
>>
>> wholefile : wholefile2 {printf("End of file.\n");};
>>
>> wholefile2 : toptrash {printf("Beginning of file.\n");}
>>   multichunk {printf("dia multichunk\n");}
>> ;
>>
>> toptrash : %empty {printf("dia empty multitrash\n");}
>>   | toptrash {printf("dia another toptrash\n");}
>>   NL {printf("dia another NL in toptrash\n");}
>> ;
>>
>> multichunk : %empty {printf("dia empty multichunk\n");}
>>   | multichunk chunk {printf("dia multichunk chunk\n");}
>> ;
>>
>> chunk : PARATEXT { printf("%s %s\n", "PARATEXT", $1); }
>>   parend {printf("dia parend\n");}
>> ;
>>
>> parend :
>>   multinewline {printf("dia multinewline\n");}
>>   | YYEOF {printf("dia YYEOF\n");}
>>
>> multinewline : NL  { printf("%s%s", "NL_tok1",$1); }
>>   NL   { printf("%s%s", "NL_tok2",$1); }
>>   | multinewline NL   { printf("%s\n", "dia multinewline NL"); }
>> ;
>>
>> %%
>>
>> int main(int argc, char *argv[]){
>> printf("\nStarting...\n");
>> printf("dia value of YYEOF is %d\n", YYEOF);
>> yyparse();
>> printf("\nFinished...\n");
>> }
>>
>> 
>>
>> Thanks,
>>
>> SteveT
>>
>> Steve Litt
>>
>> Autumn 2023 featured book: Rapid Learning for the 21st Century
>> http://www.troubleshooters.com/rl21
>>
>>  



Re: Syntax error if paragraph contains more than 1 printable character

2023-12-12 Thread Chris verBurg
Hey Steve,

My reading of your code is that PARATEXT will only ever be a single
character.  I'm thinking you want the flex rule to be ".*" (etc) instead of
just ".".

I'm curious whether your paragraphs are allowed to contain NLs.  If so,
you're going to have to include them in the PARATEXT token value, and also
update it to not match more than one.  I don't know offhand if there's some
cleverness with trailing context that could be used there.

-Chris


On Tue, Dec 12, 2023 at 9:29 AM Steve Litt 
wrote:

> Hi all,
>
> I'm creating a parser that takes a text file whose paragraphs are
> separated by blank lines. Unfortunately, if the input file contains a
> paragraph with more than one non-space character, it gives me a syntax
> error via yyerror(). So the following works, where = etc are not in
> the inpyut file but just signify beginning and end of file:
>
> 
> a
>
> b
>
> c
>
> 
>
> The following throws a syntax error:
>
> 
> a
>
> bx
>
> c
>
> 
>
> The following also throws a syntax error:
>
> 
> a
>
> b
> x
>
> c
>
> 
>
> I'd appreciate any guidance as to what is wrong with my Flex and Bison
> programs (I suspect my rules in the rules section of Bison). My Flex
> and Bison programs follow, once again delineated by lines of equal
> signs that don't exist in the program:
>
> === Flex Program =
> %option noinput nounput
> %{
> #include "paragraphs.tab.h"
> %}
>
> %%
>
> [ \t]*\n{strcpy (yylval.y_char, yytext); return NL;  }
> .   {strcpy (yylval.y_char, yytext); return PARATEXT; }
>
> %%
>
>
> int yywrap(void)
> {
>  return 1;
> }
>
> int yyerror(char *errormsg)
> {
>   fprintf(stderr, "%s\n", errormsg);
>   exit(1);
> }
> 
>
>
> === Bison Program =
> %{
>
> #include 
> #include 
> int yylex(void);
> int yyerror (char *errmsg);
> #define EOF_ 0
> %}
>
> %union {
> chary_char [1];
> }
> %tokenPARATEXT
> %tokenNL
> %%
>
> wholefile : wholefile2 {printf("End of file.\n");};
>
> wholefile2 : toptrash {printf("Beginning of file.\n");}
>   multichunk {printf("dia multichunk\n");}
> ;
>
> toptrash : %empty {printf("dia empty multitrash\n");}
>   | toptrash {printf("dia another toptrash\n");}
>   NL {printf("dia another NL in toptrash\n");}
> ;
>
> multichunk : %empty {printf("dia empty multichunk\n");}
>   | multichunk chunk {printf("dia multichunk chunk\n");}
> ;
>
> chunk : PARATEXT { printf("%s %s\n", "PARATEXT", $1); }
>   parend {printf("dia parend\n");}
> ;
>
> parend :
>   multinewline {printf("dia multinewline\n");}
>   | YYEOF {printf("dia YYEOF\n");}
>
> multinewline : NL  { printf("%s%s", "NL_tok1",$1); }
>   NL   { printf("%s%s", "NL_tok2",$1); }
>   | multinewline NL   { printf("%s\n", "dia multinewline NL"); }
> ;
>
> %%
>
> int main(int argc, char *argv[]){
> printf("\nStarting...\n");
> printf("dia value of YYEOF is %d\n", YYEOF);
> yyparse();
> printf("\nFinished...\n");
> }
>
> 
>
> Thanks,
>
> SteveT
>
> Steve Litt
>
> Autumn 2023 featured book: Rapid Learning for the 21st Century
> http://www.troubleshooters.com/rl21
>
>