Another version of my test code, checking .tell throughout:

use v6;
use Test;

my $tmpdir = IO::Spec::Unix.tmpdir;
my $file = "$tmpdir/scratch_file.txt";
my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";  # ሀⶀ䷼ꪪⲤⲎ
my $ascii_str =   "ABCDEFGHI";

test_read_and_read_again($unichar_str, $file, 3);
test_read_and_read_again($ascii_str,   $file, 0);

# write given string to file, then read the third character twice and check
sub test_read_and_read_again($str, $file, $nudge = 0) {
    spurt $file, $str;
    my $fh = $file.IO.open;
    printf "%d: just opened\n", $fh.tell;
    $fh.readchars(2);  # skip a few
    printf "%d: after skipping 2\n", $fh.tell;
    my $chr_1 =      $fh.readchars(1);
    printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
    my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always 1 or 3
    my $step_back = $width + $nudge;
    $fh.seek: -$step_back, SeekFromCurrent;
    printf "%d: after seeking back %d\n", $fh.tell, $step_back;
    my $chr_2 =      $fh.readchars(1);
    printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
    is( $chr_1, $chr_2,
        "read, seek back, and read again gets same char with nudge of $nudge" );
}


The output looks like so:

/home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
0: just opened
9: after skipping 2
12: after reading 3rd: ䷼
6: after seeking back 6
12: after re-reading 3rd: ䷼
ok 1 - read, seek back, and read again gets same char with nudge of 3
0: just opened
2: after skipping 2
3: after reading 3rd: C
2: after seeking back 1
3: after re-reading 3rd: C
ok 2 - read, seek back, and read again gets same char with nudge of 0

It's really hard to see what I should do if I really wanted to
intermix readchars and seeks like this... I'd need to check the range
of the codepoint to see how far I need to seek to get where I expect
to be.



On 4/24/20, Joseph Brenner <doom...@gmail.com> wrote:
> Thanks, yes I understand unicode and utf-8 reasonably well.
>
>> So Rakudo has to read the next codepoint to make sure that it isn't a
>> combining codepoint.
>
>> It is probably faking up the reads to look right when reading ASCII, but
>> failing to do that for wider codepoints.
>
> I think it'd be the other way around... the idea here would be it's
> doing an extra readchar behind the scenes just in-case there's
> combining chars involved-- so you're figuring there's some confusion
> about the actual point in the file that's being read and the
> abstraction that readchars is supplying?
>
>
> On 4/24/20, Brad Gilbert <b2gi...@gmail.com> wrote:
>> In UTF8 characters can be 1 to 4 bytes long.
>>
>> UTF8 was designed so that 7-bit ASCII is a subset of it.
>>
>> Any 8bit byte that has its most significant bit set cannot be ASCII.
>> So multi-byte codepoints have the most significant bit set for all of the
>> bytes.
>> The first byte can tell you the number of bytes that follow it.
>>
>> That is how a singe codepoint is stored.
>>
>> A character can be made of several codepoints.
>>
>>     "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
>>     "é"
>>
>> So Rakudo has to read the next codepoint to make sure that it isn't a
>> combining codepoint.
>>
>> It is probably faking up the reads to look right when reading ASCII, but
>> failing to do that for wider codepoints.
>>
>> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doom...@gmail.com> wrote:
>>
>>> I thought that doing a readchars on a filehandle, seeking backwards
>>> the width of the char in bytes and then doing another read
>>> would always get the same character.  That works for ascii-range
>>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>>> characters (commonly 3-bytes in utf-8).
>>>
>>> The question then, is why do I need a $nudge of 3 for wide chars, but
>>> not ascii-range ones?
>>>
>>> use v6;
>>> use Test;
>>>
>>> my $tmpdir = IO::Spec::Unix.tmpdir;
>>> my $file = "$tmpdir/scratch_file.txt";
>>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";  #
>>> ሀⶀ䷼ꪪⲤⲎ
>>> my $ascii_str =   "ABCDEFGHI";
>>>
>>> subtest {
>>>     my $nudge = 3;
>>>     test_read_and_read_again($unichar_str, $file, $nudge);
>>> }, "Wide unicode chars: $unichar_str";
>>>
>>> subtest {
>>>     my $nudge = 0;
>>>     test_read_and_read_again($ascii_str, $file, $nudge);
>>> }, "Ascii-range chars: $ascii_str";
>>>
>>> # write given string to file, then read the third character twice and
>>> check
>>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>>>     spurt $file, $str;
>>>     my $fh = $file.IO.open;
>>>     $fh.readchars(2);  # skip a few
>>>     my $chr_1 =      $fh.readchars(1);
>>>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
>>> always
>>> 1 or 3
>>>     my $step_back = $width + $nudge;
>>>     $fh.seek: -$step_back, SeekFromCurrent;
>>>     my $chr_2 =      $fh.readchars(1);
>>>     is( $chr_1, $chr_2,
>>>         "read, seek back, and read again gets same char with nudge of
>>> $nudge" );
>>> }
>>>
>>
>

Reply via email to