Unicode characters counted as multiple characters

Ion Toloaca Fri, 23 May 2014 08:18:44 -0700

Hello everyone,

      I have been trying to get the start and end positions of the last 
matched rule for some time -
and I got into trouble when I tried an example that had unicode. Here is a 
simplified version 
below that shows that the position (the return value of the read() method) 
is counted in a wrong
because of the unicode character; the code works fine if it is replaced 
with a non-unicode char, 
for example '='. (The start position and the length are, by the way, given 
in token - but I solve this
by using my $last_expression = $recce->substring($start_rule, 
$length_rule); and getting its length.)
     Is here Marpa at fault for not counting unicode right - or did I just 
use "encode", "decode" or something
else in wrong way?


#################################################################################################
use utf8;
use 5.014;
use strict;
use warnings;
use Data::Dumper; 
use Marpa::R2;
use Encode;

my $dsl = encode("UTF-8",<<END_OF_DSL);
:start ::= Start
:default ::= action => do_print
Start ::= Rule1 
Rule1 ::= '≠'
event 'Start' = completed Start 
END_OF_DSL


#Initialize grammar#
my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );
my $recce = Marpa::R2::Scanless::R->new(
    { grammar => $grammar, semantics_package => 'My_Actions' } );


my $input = encode("UTF-8",'≠');
my $pos = $recce->read( \$input);
my ($start_rule, $length_rule) = $recce->last_completed("Start");
print("$pos"); # $pos == 3 since the unicode symbol is counted as 3 symbols 
(for usual symbols - $pos ==1 as expected)
###############################################################################################

    Thank you in advance for help regarding this issue

Best regards, 
Toloaca Ion

-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to marpa-parser+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Unicode characters counted as multiple characters

Reply via email to