Re: readchars, seek back, and readchars again

Samantha McVey Tue, 28 Apr 2020 10:33:23 -0700

On maandag 27 april 2020 09:49:20 CEST Joseph Brenner wrote:
> After you do a .readchars, what point in the file would you expect to
> be "current"?  I would expect it would be the point right after the
> last char read.  Instead that's true if you're reading ascii
> characters but not unicode characters up above the ascii range, in
> which case the "current" point is larger than that (in the cases I've
> looked at, larger by 3 bytes).
> 
> If you try to intermix readchars with calls to .seek using the
> "SeekFromCurrent" feature, it can be tricky to predict where you're
> going to end up, because the point you're starting at depends on what
> kind text you've been reading, not just the number of bytes you've
> read.
> 
> Is that making any sense?  I posted a later code example that might
> show the problem more clearly...
> 
> On 4/26/20, Samantha McVey <[email protected]> wrote:
> > On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote:
> >> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> > 
> > strings:
> >> >  https://github.com/rakudo/rakudo/issues/3461
> >> >  
> >> >  I know it might be far-fetched, but what if your UTF-8 issue and
> >> 
> >> Yary's UTF-16 issue were related
> >> 
> >> Well, an issue with handling combining characters could easily effect
> >> both, nothing about it is specific to one encoding. Yary's issue
> >> doesn't have to do with reading from disk though, he's just looking at
> >> the raw bytes the encoding generates.
> >> 
> >> On 4/24/20, William Michels <[email protected]> wrote:
> >> > Hi Joe,
> >> > 
> >> > I was able to run the code you posted and reproduced the exact same
> >> > result (Rakudo version 2020.02.1.0000.1 built on MoarVM version
> >> > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a
> >> > bit
> >> > (e.g. UTF8-C8), but I didn't see any improvement.
> >> > 
> >> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> >> > strings:
> >> > 
> >> > https://github.com/rakudo/rakudo/issues/3461
> >> > 
> >> > I know it might be far-fetched, but what if your UTF-8 issue and
> >> > Yary's UTF-16 issue were related? It would be nice to kill two birds
> >> > with one stone.
> >> > 
> >> > Best Regards, Bill.
> >> > 
> >> > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner <[email protected]>
> >> > 
> >> > wrote:
> >> >> Another version of my test code, checking .tell throughout:
> >> >> 
> >> >> use v6;
> >> >> use Test;
> >> >> 
> >> >> my $tmpdir = IO::Spec::Unix.tmpdir;
> >> >> my $file = "$tmpdir/scratch_file.txt";
> >> >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";
> >> >> #
> >> >> ሀⶀ䷼ꪪⲤⲎ
> >> >> my $ascii_str =   "ABCDEFGHI";
> >> >> 
> >> >> test_read_and_read_again($unichar_str, $file, 3);
> >> >> test_read_and_read_again($ascii_str,   $file, 0);
> >> >> 
> >> >> # write given string to file, then read the third character twice and
> >> >> check
> >> >> sub test_read_and_read_again($str, $file, $nudge = 0) {
> >> >> 
> >> >>     spurt $file, $str;
> >> >>     my $fh = $file.IO.open;
> >> >>     printf "%d: just opened\n", $fh.tell;
> >> >>     $fh.readchars(2);  # skip a few
> >> >>     printf "%d: after skipping 2\n", $fh.tell;
> >> >>     my $chr_1 =      $fh.readchars(1);
> >> >>     printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
> >> >>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
> >> >> 
> >> >> always
> >> >> 
> >> >> 1 or 3
> >> >> 
> >> >>     my $step_back = $width + $nudge;
> >> >>     $fh.seek: -$step_back, SeekFromCurrent;
> >> >>     printf "%d: after seeking back %d\n", $fh.tell, $step_back;
> >> >>     my $chr_2 =      $fh.readchars(1);
> >> >>     printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
> >> >>     is( $chr_1, $chr_2,
> >> >>     
> >> >>         "read, seek back, and read again gets same char with nudge of
> >> >> 
> >> >> $nudge" );
> >> >> }
> >> >> 
> >> >> 
> >> >> The output looks like so:
> >> >> 
> >> >> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
> >> >> 0: just opened
> >> >> 9: after skipping 2
> >> >> 12: after reading 3rd: ䷼
> >> >> 6: after seeking back 6
> >> >> 12: after re-reading 3rd: ䷼
> >> >> ok 1 - read, seek back, and read again gets same char with nudge of 3
> >> >> 0: just opened
> >> >> 2: after skipping 2
> >> >> 3: after reading 3rd: C
> >> >> 2: after seeking back 1
> >> >> 3: after re-reading 3rd: C
> >> >> ok 2 - read, seek back, and read again gets same char with nudge of 0
> >> >> 
> >> >> It's really hard to see what I should do if I really wanted to
> >> >> intermix readchars and seeks like this... I'd need to check the range
> >> >> of the codepoint to see how far I need to seek to get where I expect
> >> >> to be.
> >> >> 
> >> >> On 4/24/20, Joseph Brenner <[email protected]> wrote:
> >> >> > Thanks, yes I understand unicode and utf-8 reasonably well.
> >> >> > 
> >> >> >> So Rakudo has to read the next codepoint to make sure that it isn't
> >> >> >> a
> >> >> >> combining codepoint.
> >> >> >> 
> >> >> >> It is probably faking up the reads to look right when reading
> >> >> >> ASCII,
> >> >> >> but
> >> >> >> failing to do that for wider codepoints.
> >> >> > 
> >> >> > I think it'd be the other way around... the idea here would be it's
> >> >> > doing an extra readchar behind the scenes just in-case there's
> >> >> > combining chars involved-- so you're figuring there's some confusion
> >> >> > about the actual point in the file that's being read and the
> >> >> > abstraction that readchars is supplying?
> >> >> > 
> >> >> > On 4/24/20, Brad Gilbert <[email protected]> wrote:
> >> >> >> In UTF8 characters can be 1 to 4 bytes long.
> >> >> >> 
> >> >> >> UTF8 was designed so that 7-bit ASCII is a subset of it.
> >> >> >> 
> >> >> >> Any 8bit byte that has its most significant bit set cannot be
> >> >> >> ASCII.
> >> >> >> So multi-byte codepoints have the most significant bit set for all
> >> >> >> of
> >> >> >> the
> >> >> >> bytes.
> >> >> >> The first byte can tell you the number of bytes that follow it.
> >> >> >> 
> >> >> >> That is how a singe codepoint is stored.
> >> >> >> 
> >> >> >> A character can be made of several codepoints.
> >> >> >> 
> >> >> >>     "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
> >> >> >>     "é"
> >> >> >> 
> >> >> >> So Rakudo has to read the next codepoint to make sure that it isn't
> >> >> >> a
> >> >> >> combining codepoint.
> >> >> >> 
> >> >> >> It is probably faking up the reads to look right when reading
> >> >> >> ASCII,
> >> >> >> but
> >> >> >> failing to do that for wider codepoints.
> >> >> >> 
> >> >> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <[email protected]>
> >> >> >> 
> >> >> >> wrote:
> >> >> >>> I thought that doing a readchars on a filehandle, seeking
> >> >> >>> backwards
> >> >> >>> the width of the char in bytes and then doing another read
> >> >> >>> would always get the same character.  That works for ascii-range
> >> >> >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
> >> >> >>> characters (commonly 3-bytes in utf-8).
> >> >> >>> 
> >> >> >>> The question then, is why do I need a $nudge of 3 for wide chars,
> >> >> >>> but
> >> >> >>> not ascii-range ones?
> >> >> >>> 
> >> >> >>> use v6;
> >> >> >>> use Test;
> >> >> >>> 
> >> >> >>> my $tmpdir = IO::Spec::Unix.tmpdir;
> >> >> >>> my $file = "$tmpdir/scratch_file.txt";
> >> >> >>> my $unichar_str =
> >> >> >>> "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";
> >> >> >>> #
> >> >> >>> ሀⶀ䷼ꪪⲤⲎ
> >> >> >>> my $ascii_str =   "ABCDEFGHI";
> >> >> >>> 
> >> >> >>> subtest {
> >> >> >>> 
> >> >> >>>     my $nudge = 3;
> >> >> >>>     test_read_and_read_again($unichar_str, $file, $nudge);
> >> >> >>> 
> >> >> >>> }, "Wide unicode chars: $unichar_str";
> >> >> >>> 
> >> >> >>> subtest {
> >> >> >>> 
> >> >> >>>     my $nudge = 0;
> >> >> >>>     test_read_and_read_again($ascii_str, $file, $nudge);
> >> >> >>> 
> >> >> >>> }, "Ascii-range chars: $ascii_str";
> >> >> >>> 
> >> >> >>> # write given string to file, then read the third character twice
> >> >> >>> and
> >> >> >>> check
> >> >> >>> sub test_read_and_read_again($str, $file, $nudge = 0) {
> >> >> >>> 
> >> >> >>>     spurt $file, $str;
> >> >> >>>     my $fh = $file.IO.open;
> >> >> >>>     $fh.readchars(2);  # skip a few
> >> >> >>>     my $chr_1 =      $fh.readchars(1);
> >> >> >>>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
> >> >> >>> 
> >> >> >>> always
> >> >> >>> 1 or 3
> >> >> >>> 
> >> >> >>>     my $step_back = $width + $nudge;
> >> >> >>>     $fh.seek: -$step_back, SeekFromCurrent;
> >> >> >>>     my $chr_2 =      $fh.readchars(1);
> >> >> >>>     is( $chr_1, $chr_2,
> >> >> >>>     
> >> >> >>>         "read, seek back, and read again gets same char with nudge
> >> >> >>> 
> >> >> >>> of
> >> >> >>> 
> >> >> >>> $nudge" );
> >> >> >>> }
> > 
> > I don't think the utf-16 issue is related. On the topic of readchars. Can
> > someone tell me what that readchars script result is unexpected? Maybe I
> > missed part of the conversation but if someone can summarize expected vs
> > actual result that would be great.


On my initial look it seems like readchars is broken. If I do:

Write one character to a file:

spurt "test", "\c[woman]"
my $fh = open "test", :read;
$fh.tell; # 0 All good so far
$fh.readchars(1); # this returns \c[woman]
$fh.tell; # returns 4

So this works. But if I do:

spurt "test", "\c[woman]\c[man]";
my $fh = open "test", :read;
$fh.readchars(1); # this returns \c[woman]
$fh.tell; # this returns... 8...

So it looks like readchars is reading the whole file (at least for these very 
small files). If I do "\c[woman]\c[man]\c[cat]" and do readchars(1), .tell 
returns 12.

This is the results for me on: This is Rakudo version 2020.02.1-254-g87d2ff953 
built on MoarVM version 2020.02.1-66-g4f08d803f
implementing Raku 6.d.

I am going to compile the development build, and then maybe see if I can look 
into Rakudo source.

Re: readchars, seek back, and readchars again

Reply via email to