On maandag 27 april 2020 09:49:20 CEST Joseph Brenner wrote: > After you do a .readchars, what point in the file would you expect to > be "current"? I would expect it would be the point right after the > last char read. Instead that's true if you're reading ascii > characters but not unicode characters up above the ascii range, in > which case the "current" point is larger than that (in the cases I've > looked at, larger by 3 bytes). > > If you try to intermix readchars with calls to .seek using the > "SeekFromCurrent" feature, it can be tricky to predict where you're > going to end up, because the point you're starting at depends on what > kind text you've been reading, not just the number of bytes you've > read. > > Is that making any sense? I posted a later code example that might > show the problem more clearly... > > On 4/26/20, Samantha McVey <samant...@posteo.net> wrote: > > On zaterdag 25 april 2020 21:51:41 CEST Joseph Brenner wrote: > >> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded > > > > strings: > >> > https://github.com/rakudo/rakudo/issues/3461 > >> > > >> > I know it might be far-fetched, but what if your UTF-8 issue and > >> > >> Yary's UTF-16 issue were related > >> > >> Well, an issue with handling combining characters could easily effect > >> both, nothing about it is specific to one encoding. Yary's issue > >> doesn't have to do with reading from disk though, he's just looking at > >> the raw bytes the encoding generates. > >> > >> On 4/24/20, William Michels <w...@caa.columbia.edu> wrote: > >> > Hi Joe, > >> > > >> > I was able to run the code you posted and reproduced the exact same > >> > result (Rakudo version 2020.02.1.0000.1 built on MoarVM version > >> > 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a > >> > bit > >> > (e.g. UTF8-C8), but I didn't see any improvement. > >> > > >> > Yary has an issue posted regarding 'display-width' of UTF-16 encoded > >> > strings: > >> > > >> > https://github.com/rakudo/rakudo/issues/3461 > >> > > >> > I know it might be far-fetched, but what if your UTF-8 issue and > >> > Yary's UTF-16 issue were related? It would be nice to kill two birds > >> > with one stone. > >> > > >> > Best Regards, Bill. > >> > > >> > On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner <doom...@gmail.com> > >> > > >> > wrote: > >> >> Another version of my test code, checking .tell throughout: > >> >> > >> >> use v6; > >> >> use Test; > >> >> > >> >> my $tmpdir = IO::Spec::Unix.tmpdir; > >> >> my $file = "$tmpdir/scratch_file.txt"; > >> >> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]"; > >> >> # > >> >> ሀⶀ䷼ꪪⲤⲎ > >> >> my $ascii_str = "ABCDEFGHI"; > >> >> > >> >> test_read_and_read_again($unichar_str, $file, 3); > >> >> test_read_and_read_again($ascii_str, $file, 0); > >> >> > >> >> # write given string to file, then read the third character twice and > >> >> check > >> >> sub test_read_and_read_again($str, $file, $nudge = 0) { > >> >> > >> >> spurt $file, $str; > >> >> my $fh = $file.IO.open; > >> >> printf "%d: just opened\n", $fh.tell; > >> >> $fh.readchars(2); # skip a few > >> >> printf "%d: after skipping 2\n", $fh.tell; > >> >> my $chr_1 = $fh.readchars(1); > >> >> printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1; > >> >> my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, > >> >> > >> >> always > >> >> > >> >> 1 or 3 > >> >> > >> >> my $step_back = $width + $nudge; > >> >> $fh.seek: -$step_back, SeekFromCurrent; > >> >> printf "%d: after seeking back %d\n", $fh.tell, $step_back; > >> >> my $chr_2 = $fh.readchars(1); > >> >> printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2; > >> >> is( $chr_1, $chr_2, > >> >> > >> >> "read, seek back, and read again gets same char with nudge of > >> >> > >> >> $nudge" ); > >> >> } > >> >> > >> >> > >> >> The output looks like so: > >> >> > >> >> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6 > >> >> 0: just opened > >> >> 9: after skipping 2 > >> >> 12: after reading 3rd: ䷼ > >> >> 6: after seeking back 6 > >> >> 12: after re-reading 3rd: ䷼ > >> >> ok 1 - read, seek back, and read again gets same char with nudge of 3 > >> >> 0: just opened > >> >> 2: after skipping 2 > >> >> 3: after reading 3rd: C > >> >> 2: after seeking back 1 > >> >> 3: after re-reading 3rd: C > >> >> ok 2 - read, seek back, and read again gets same char with nudge of 0 > >> >> > >> >> It's really hard to see what I should do if I really wanted to > >> >> intermix readchars and seeks like this... I'd need to check the range > >> >> of the codepoint to see how far I need to seek to get where I expect > >> >> to be. > >> >> > >> >> On 4/24/20, Joseph Brenner <doom...@gmail.com> wrote: > >> >> > Thanks, yes I understand unicode and utf-8 reasonably well. > >> >> > > >> >> >> So Rakudo has to read the next codepoint to make sure that it isn't > >> >> >> a > >> >> >> combining codepoint. > >> >> >> > >> >> >> It is probably faking up the reads to look right when reading > >> >> >> ASCII, > >> >> >> but > >> >> >> failing to do that for wider codepoints. > >> >> > > >> >> > I think it'd be the other way around... the idea here would be it's > >> >> > doing an extra readchar behind the scenes just in-case there's > >> >> > combining chars involved-- so you're figuring there's some confusion > >> >> > about the actual point in the file that's being read and the > >> >> > abstraction that readchars is supplying? > >> >> > > >> >> > On 4/24/20, Brad Gilbert <b2gi...@gmail.com> wrote: > >> >> >> In UTF8 characters can be 1 to 4 bytes long. > >> >> >> > >> >> >> UTF8 was designed so that 7-bit ASCII is a subset of it. > >> >> >> > >> >> >> Any 8bit byte that has its most significant bit set cannot be > >> >> >> ASCII. > >> >> >> So multi-byte codepoints have the most significant bit set for all > >> >> >> of > >> >> >> the > >> >> >> bytes. > >> >> >> The first byte can tell you the number of bytes that follow it. > >> >> >> > >> >> >> That is how a singe codepoint is stored. > >> >> >> > >> >> >> A character can be made of several codepoints. > >> >> >> > >> >> >> "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]" > >> >> >> "é" > >> >> >> > >> >> >> So Rakudo has to read the next codepoint to make sure that it isn't > >> >> >> a > >> >> >> combining codepoint. > >> >> >> > >> >> >> It is probably faking up the reads to look right when reading > >> >> >> ASCII, > >> >> >> but > >> >> >> failing to do that for wider codepoints. > >> >> >> > >> >> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doom...@gmail.com> > >> >> >> > >> >> >> wrote: > >> >> >>> I thought that doing a readchars on a filehandle, seeking > >> >> >>> backwards > >> >> >>> the width of the char in bytes and then doing another read > >> >> >>> would always get the same character. That works for ascii-range > >> >> >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide" > >> >> >>> characters (commonly 3-bytes in utf-8). > >> >> >>> > >> >> >>> The question then, is why do I need a $nudge of 3 for wide chars, > >> >> >>> but > >> >> >>> not ascii-range ones? > >> >> >>> > >> >> >>> use v6; > >> >> >>> use Test; > >> >> >>> > >> >> >>> my $tmpdir = IO::Spec::Unix.tmpdir; > >> >> >>> my $file = "$tmpdir/scratch_file.txt"; > >> >> >>> my $unichar_str = > >> >> >>> "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]"; > >> >> >>> # > >> >> >>> ሀⶀ䷼ꪪⲤⲎ > >> >> >>> my $ascii_str = "ABCDEFGHI"; > >> >> >>> > >> >> >>> subtest { > >> >> >>> > >> >> >>> my $nudge = 3; > >> >> >>> test_read_and_read_again($unichar_str, $file, $nudge); > >> >> >>> > >> >> >>> }, "Wide unicode chars: $unichar_str"; > >> >> >>> > >> >> >>> subtest { > >> >> >>> > >> >> >>> my $nudge = 0; > >> >> >>> test_read_and_read_again($ascii_str, $file, $nudge); > >> >> >>> > >> >> >>> }, "Ascii-range chars: $ascii_str"; > >> >> >>> > >> >> >>> # write given string to file, then read the third character twice > >> >> >>> and > >> >> >>> check > >> >> >>> sub test_read_and_read_again($str, $file, $nudge = 0) { > >> >> >>> > >> >> >>> spurt $file, $str; > >> >> >>> my $fh = $file.IO.open; > >> >> >>> $fh.readchars(2); # skip a few > >> >> >>> my $chr_1 = $fh.readchars(1); > >> >> >>> my $width = $chr_1.encode('UTF-8').bytes; # for our purposes, > >> >> >>> > >> >> >>> always > >> >> >>> 1 or 3 > >> >> >>> > >> >> >>> my $step_back = $width + $nudge; > >> >> >>> $fh.seek: -$step_back, SeekFromCurrent; > >> >> >>> my $chr_2 = $fh.readchars(1); > >> >> >>> is( $chr_1, $chr_2, > >> >> >>> > >> >> >>> "read, seek back, and read again gets same char with nudge > >> >> >>> > >> >> >>> of > >> >> >>> > >> >> >>> $nudge" ); > >> >> >>> } > > > > I don't think the utf-16 issue is related. On the topic of readchars. Can > > someone tell me what that readchars script result is unexpected? Maybe I > > missed part of the conversation but if someone can summarize expected vs > > actual result that would be great.
On my initial look it seems like readchars is broken. If I do: Write one character to a file: spurt "test", "\c[woman]" my $fh = open "test", :read; $fh.tell; # 0 All good so far $fh.readchars(1); # this returns \c[woman] $fh.tell; # returns 4 So this works. But if I do: spurt "test", "\c[woman]\c[man]"; my $fh = open "test", :read; $fh.readchars(1); # this returns \c[woman] $fh.tell; # this returns... 8... So it looks like readchars is reading the whole file (at least for these very small files). If I do "\c[woman]\c[man]\c[cat]" and do readchars(1), .tell returns 12. This is the results for me on: This is Rakudo version 2020.02.1-254-g87d2ff953 built on MoarVM version 2020.02.1-66-g4f08d803f implementing Raku 6.d. I am going to compile the development build, and then maybe see if I can look into Rakudo source.