On 19 Dec 2021 at 7:54, Ed Greshko wrote:

From:                   Ed Greshko <ed.gres...@greshko.com>
Date sent:              Sun, 19 Dec 2021 07:54:31 +0800
Subject:                Re: Having strange result on processing
UTF-8 file
To:                     users@lists.fedoraproject.org
Send reply to:          Community support for Fedora
users <users@lists.fedoraproject.org>

> On 19/12/2021 02:15, Michael D. Setzer II via users wrote:
> > Download 64 web pages into a single file using wget2. That is fine.
>
> One more thing.....
>
> The single file you get is an html formatted file, yes?  For the results that 
> you want, and how you want to
> use it, do you really want html?  If not, why don't you convert to plain text?
>
> Can we assume the 64 pages are always the same pages?
>
Yes. Figured a work around, but not exactly sure that the
issue is that changes the file from UTF-8 to strange type.
system("wget2 --max-threads=70 --secure-protocol=PFS -q 
--base=\"https://www.uog.edu/directory/\";
-i testlistuog");
testlist.uog has lines
?page=01
?page=02
---
?page=64

But could change if they add more or remove some
currently 633 records. Some lines in the file are over
25000 characters?? Total download is about 13M.
The actual lines I need for the data are just 256K, so it
has lots of junk (stuff I don't need for what I'm doing).

Originally had if find where the UTF-8 characters where
on line, and printed out the hex for the 2 or 3 byte
strings. Then would print from that point in line using
%10.10s since didn't need to see all lines?? But that
causes the problem? But not sure why.

Modified program to just print out the 2 or 3 byte UTF-8
character and file stays the same as original file. Then
tried just using %s and it also stays a UTF-8 file?? But as
I mentioned some lines are over 25000 character? Some
lines have multiple UTF-8 characters, so perhaps the
%10.10s was hitting in the middle of some UTF8 code?

Contents of the main function. Not  pretty, but works.

FILE *fp1,*fp2;
char line[32000],fileout[20];
unsigned char c1,c2,c3;
size_t i;
int j=0;
if (argc<2)
{
        printf("Need File name??");
        exit(1);
}
fp1=fopen(argv[1],"r");
strcpy(fileout,argv[1]);
strcat(fileout,".out");
fp2=fopen(fileout,"wb");
while(!feof(fp1))
{
        fgets(line,32000,fp1);
        line[strlen(line)-1]=0;
        j++;
        if(feof(fp1)) break;
        if(strlen(line)<3) continue;
        for(i=0;i<(strlen(line)-2);i++)
        {
                if(line[i]<=0)
                {
                        c1=256+line[i];
                        c2=256+line[i+1];
                        c3=256+line[i+2];
                        if(c1!=194 && c1!=195 && c1!=196 && c1!=200)
    fprintf(fp2,"%5d %5ld %2.2x%2.2x%2.2x   [%s]\n",j,(long)i,
    c1,c2,c3,&line[i]);
                else
    fprintf(fp2,"%5d %5ld %2.2x%2.2x     [%s]\n",j,(long)i,
    c1,c2,&line[i]);
                        if(c1!=194 && c1!=195 && c1!=196 && c1!=200) i++;
                        i++;
                }
        }
}
fclose(fp1); fclose(fp2);
return 0;


Thanks again. Will try and figure what causes it to go
from UTF-8?? Like I said, the pages have lots of weird
lines. But get the data I need, and make a mariadb with
the 633 records that can be sorted via php..
There are actually only 3 lines I use that have UTF-8
character - while the main file has 2000 lines with UTF-8
code. Guess atleast one of those lines caused the issue??

  131    27 c3b1     [ña, Ph.D.;Crisostomo-Muña;Doreen;Professor of 
Accounting;School of Business &
Public Administration;735-2501/20;doree...@triton.uog.edu]
  131    51 c3b1     [ña;Doreen;Professor of Accounting;School of Business & 
Public
Administration;735-2501/20;doree...@triton.uog.edu]
  276    14 c3a5     [åni" Isidro;Isidro;Jaevani;Junior Web Developer;Office of 
Information
Technology;735-2631;jisi...@triton.uog.edu]
  344    18 c381     [Álvarez-Piñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director 
/ Associate Professor of
Spanish Pacific History;Micronesian Area Research 
Center;735-2156;madr...@triton.uog.edu]
  344    29 c3b1     [ñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / 
Associate Professor of Spanish
Pacific History;Micronesian Area Research 
Center;735-2156;madr...@triton.uog.edu]
  344    48 c381     [Álvarez-Piñer;Carlos;Director / Associate Professor of 
Spanish Pacific
History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu]
  344    59 c3b1     [ñer;Carlos;Director / Associate Professor of Spanish 
Pacific History;Micronesian
Area Research Center;735-2156;madr...@triton.uog.edu]

tried a number of things with iconv, but still ended with
the problem format.

Again, thanks for the time.

> --
> Did 황준호 die?
> _______________________________________________
> users mailing list -- users@lists.fedoraproject.org
> To unsubscribe send an email to users-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
> Do not reply to spam on the list, report it: 
> https://pagure.io/fedora-infrastructure


_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

Reply via email to