Re: readchars, seek back, and readchars again

Joseph Brenner Mon, 27 Apr 2020 00:49:47 -0700

After you do a .readchars, what point in the file would you expect to
be "current"?  I would expect it would be the point right after the
last char read.  Instead that's true if you're reading ascii
characters but not unicode characters up above the ascii range, in
which case the "current" point is larger than that (in the cases I've
looked at, larger by 3 bytes).


If you try to intermix readchars with calls to .seek using the
"SeekFromCurrent" feature, it can be tricky to predict where you're
going to end up, because the point you're starting at depends on what
kind text you've been reading, not just the number of bytes you've
read.

Is that making any sense?  I posted a later code example that might
show the problem more clearly...

On 4/26/20, Samantha McVey <samant...@posteo.net> wrote:
> On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote:
>> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> strings:
>> >  https://github.com/rakudo/rakudo/issues/3461
>> >
>> >  I know it might be far-fetched, but what if your UTF-8 issue and
>>
>> Yary's UTF-16 issue were related
>>
>> Well, an issue with handling combining characters could easily effect
>> both, nothing about it is specific to one encoding. Yary's issue
>> doesn't have to do with reading from disk though, he's just looking at
>> the raw bytes the encoding generates.
>>
>> On 4/24/20, William Michels <w...@caa.columbia.edu> wrote:
>> > Hi Joe,
>> >
>> > I was able to run the code you posted and reproduced the exact same
>> > result (Rakudo version 2020.02.1.0000.1 built on MoarVM version
>> > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a
>> > bit
>> > (e.g. UTF8-C8), but I didn't see any improvement.
>> >
>> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
>> > strings:
>> >
>> > https://github.com/rakudo/rakudo/issues/3461
>> >
>> > I know it might be far-fetched, but what if your UTF-8 issue and
>> > Yary's UTF-16 issue were related? It would be nice to kill two birds
>> > with one stone.
>> >
>> > Best Regards, Bill.
>> >
>> > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner <doom...@gmail.com>
>> > wrote:
>> >> Another version of my test code, checking .tell throughout:
>> >>
>> >> use v6;
>> >> use Test;
>> >>
>> >> my $tmpdir = IO::Spec::Unix.tmpdir;
>> >> my $file = "$tmpdir/scratch_file.txt";
>> >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";
>> >> #
>> >> ሀⶀ䷼ꪪⲤⲎ
>> >> my $ascii_str =   "ABCDEFGHI";
>> >>
>> >> test_read_and_read_again($unichar_str, $file, 3);
>> >> test_read_and_read_again($ascii_str,   $file, 0);
>> >>
>> >> # write given string to file, then read the third character twice and
>> >> check
>> >> sub test_read_and_read_again($str, $file, $nudge = 0) {
>> >>
>> >>     spurt $file, $str;
>> >>     my $fh = $file.IO.open;
>> >>     printf "%d: just opened\n", $fh.tell;
>> >>     $fh.readchars(2);  # skip a few
>> >>     printf "%d: after skipping 2\n", $fh.tell;
>> >>     my $chr_1 =      $fh.readchars(1);
>> >>     printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
>> >>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
>> >> always
>> >>
>> >> 1 or 3
>> >>
>> >>     my $step_back = $width + $nudge;
>> >>     $fh.seek: -$step_back, SeekFromCurrent;
>> >>     printf "%d: after seeking back %d\n", $fh.tell, $step_back;
>> >>     my $chr_2 =      $fh.readchars(1);
>> >>     printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
>> >>     is( $chr_1, $chr_2,
>> >>
>> >>         "read, seek back, and read again gets same char with nudge of
>> >>
>> >> $nudge" );
>> >> }
>> >>
>> >>
>> >> The output looks like so:
>> >>
>> >> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
>> >> 0: just opened
>> >> 9: after skipping 2
>> >> 12: after reading 3rd: ䷼
>> >> 6: after seeking back 6
>> >> 12: after re-reading 3rd: ䷼
>> >> ok 1 - read, seek back, and read again gets same char with nudge of 3
>> >> 0: just opened
>> >> 2: after skipping 2
>> >> 3: after reading 3rd: C
>> >> 2: after seeking back 1
>> >> 3: after re-reading 3rd: C
>> >> ok 2 - read, seek back, and read again gets same char with nudge of 0
>> >>
>> >> It's really hard to see what I should do if I really wanted to
>> >> intermix readchars and seeks like this... I'd need to check the range
>> >> of the codepoint to see how far I need to seek to get where I expect
>> >> to be.
>> >>
>> >> On 4/24/20, Joseph Brenner <doom...@gmail.com> wrote:
>> >> > Thanks, yes I understand unicode and utf-8 reasonably well.
>> >> >
>> >> >> So Rakudo has to read the next codepoint to make sure that it isn't
>> >> >> a
>> >> >> combining codepoint.
>> >> >>
>> >> >> It is probably faking up the reads to look right when reading
>> >> >> ASCII,
>> >> >> but
>> >> >> failing to do that for wider codepoints.
>> >> >
>> >> > I think it'd be the other way around... the idea here would be it's
>> >> > doing an extra readchar behind the scenes just in-case there's
>> >> > combining chars involved-- so you're figuring there's some confusion
>> >> > about the actual point in the file that's being read and the
>> >> > abstraction that readchars is supplying?
>> >> >
>> >> > On 4/24/20, Brad Gilbert <b2gi...@gmail.com> wrote:
>> >> >> In UTF8 characters can be 1 to 4 bytes long.
>> >> >>
>> >> >> UTF8 was designed so that 7-bit ASCII is a subset of it.
>> >> >>
>> >> >> Any 8bit byte that has its most significant bit set cannot be
>> >> >> ASCII.
>> >> >> So multi-byte codepoints have the most significant bit set for all
>> >> >> of
>> >> >> the
>> >> >> bytes.
>> >> >> The first byte can tell you the number of bytes that follow it.
>> >> >>
>> >> >> That is how a singe codepoint is stored.
>> >> >>
>> >> >> A character can be made of several codepoints.
>> >> >>
>> >> >>     "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
>> >> >>     "é"
>> >> >>
>> >> >> So Rakudo has to read the next codepoint to make sure that it isn't
>> >> >> a
>> >> >> combining codepoint.
>> >> >>
>> >> >> It is probably faking up the reads to look right when reading
>> >> >> ASCII,
>> >> >> but
>> >> >> failing to do that for wider codepoints.
>> >> >>
>> >> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doom...@gmail.com>
>> >> >>
>> >> >> wrote:
>> >> >>> I thought that doing a readchars on a filehandle, seeking
>> >> >>> backwards
>> >> >>> the width of the char in bytes and then doing another read
>> >> >>> would always get the same character.  That works for ascii-range
>> >> >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>> >> >>> characters (commonly 3-bytes in utf-8).
>> >> >>>
>> >> >>> The question then, is why do I need a $nudge of 3 for wide chars,
>> >> >>> but
>> >> >>> not ascii-range ones?
>> >> >>>
>> >> >>> use v6;
>> >> >>> use Test;
>> >> >>>
>> >> >>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> >> >>> my $file = "$tmpdir/scratch_file.txt";
>> >> >>> my $unichar_str =
>> >> >>> "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";
>> >> >>> #
>> >> >>> ሀⶀ䷼ꪪⲤⲎ
>> >> >>> my $ascii_str =   "ABCDEFGHI";
>> >> >>>
>> >> >>> subtest {
>> >> >>>
>> >> >>>     my $nudge = 3;
>> >> >>>     test_read_and_read_again($unichar_str, $file, $nudge);
>> >> >>>
>> >> >>> }, "Wide unicode chars: $unichar_str";
>> >> >>>
>> >> >>> subtest {
>> >> >>>
>> >> >>>     my $nudge = 0;
>> >> >>>     test_read_and_read_again($ascii_str, $file, $nudge);
>> >> >>>
>> >> >>> }, "Ascii-range chars: $ascii_str";
>> >> >>>
>> >> >>> # write given string to file, then read the third character twice
>> >> >>> and
>> >> >>> check
>> >> >>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>> >> >>>
>> >> >>>     spurt $file, $str;
>> >> >>>     my $fh = $file.IO.open;
>> >> >>>     $fh.readchars(2);  # skip a few
>> >> >>>     my $chr_1 =      $fh.readchars(1);
>> >> >>>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
>> >> >>>
>> >> >>> always
>> >> >>> 1 or 3
>> >> >>>
>> >> >>>     my $step_back = $width + $nudge;
>> >> >>>     $fh.seek: -$step_back, SeekFromCurrent;
>> >> >>>     my $chr_2 =      $fh.readchars(1);
>> >> >>>     is( $chr_1, $chr_2,
>> >> >>>
>> >> >>>         "read, seek back, and read again gets same char with nudge
>> >> >>> of
>> >> >>>
>> >> >>> $nudge" );
>> >> >>> }
>
> I don't think the utf-16 issue is related. On the topic of readchars. Can
> someone tell me what that readchars script result is unexpected? Maybe I
> missed part of the conversation but if someone can summarize expected vs
> actual result that would be great.
>
>
>
>

Re: readchars, seek back, and readchars again

Reply via email to