On 19 Dec 2021 at 7:54, Ed Greshko wrote: From: Ed Greshko <ed.gres...@greshko.com> Date sent: Sun, 19 Dec 2021 07:54:31 +0800 Subject: Re: Having strange result on processing UTF-8 file To: users@lists.fedoraproject.org Send reply to: Community support for Fedora users <users@lists.fedoraproject.org>
> On 19/12/2021 02:15, Michael D. Setzer II via users wrote: > > Download 64 web pages into a single file using wget2. That is fine. > > One more thing..... > > The single file you get is an html formatted file, yes? For the results that > you want, and how you want to > use it, do you really want html? If not, why don't you convert to plain text? > > Can we assume the 64 pages are always the same pages? > Yes. Figured a work around, but not exactly sure that the issue is that changes the file from UTF-8 to strange type. system("wget2 --max-threads=70 --secure-protocol=PFS -q --base=\"https://www.uog.edu/directory/\" -i testlistuog"); testlist.uog has lines ?page=01 ?page=02 --- ?page=64 But could change if they add more or remove some currently 633 records. Some lines in the file are over 25000 characters?? Total download is about 13M. The actual lines I need for the data are just 256K, so it has lots of junk (stuff I don't need for what I'm doing). Originally had if find where the UTF-8 characters where on line, and printed out the hex for the 2 or 3 byte strings. Then would print from that point in line using %10.10s since didn't need to see all lines?? But that causes the problem? But not sure why. Modified program to just print out the 2 or 3 byte UTF-8 character and file stays the same as original file. Then tried just using %s and it also stays a UTF-8 file?? But as I mentioned some lines are over 25000 character? Some lines have multiple UTF-8 characters, so perhaps the %10.10s was hitting in the middle of some UTF8 code? Contents of the main function. Not pretty, but works. FILE *fp1,*fp2; char line[32000],fileout[20]; unsigned char c1,c2,c3; size_t i; int j=0; if (argc<2) { printf("Need File name??"); exit(1); } fp1=fopen(argv[1],"r"); strcpy(fileout,argv[1]); strcat(fileout,".out"); fp2=fopen(fileout,"wb"); while(!feof(fp1)) { fgets(line,32000,fp1); line[strlen(line)-1]=0; j++; if(feof(fp1)) break; if(strlen(line)<3) continue; for(i=0;i<(strlen(line)-2);i++) { if(line[i]<=0) { c1=256+line[i]; c2=256+line[i+1]; c3=256+line[i+2]; if(c1!=194 && c1!=195 && c1!=196 && c1!=200) fprintf(fp2,"%5d %5ld %2.2x%2.2x%2.2x [%s]\n",j,(long)i, c1,c2,c3,&line[i]); else fprintf(fp2,"%5d %5ld %2.2x%2.2x [%s]\n",j,(long)i, c1,c2,&line[i]); if(c1!=194 && c1!=195 && c1!=196 && c1!=200) i++; i++; } } } fclose(fp1); fclose(fp2); return 0; Thanks again. Will try and figure what causes it to go from UTF-8?? Like I said, the pages have lots of weird lines. But get the data I need, and make a mariadb with the 633 records that can be sorted via php.. There are actually only 3 lines I use that have UTF-8 character - while the main file has 2000 lines with UTF-8 code. Guess atleast one of those lines caused the issue?? 131 27 c3b1 [ña, Ph.D.;Crisostomo-Muña;Doreen;Professor of Accounting;School of Business & Public Administration;735-2501/20;doree...@triton.uog.edu] 131 51 c3b1 [ña;Doreen;Professor of Accounting;School of Business & Public Administration;735-2501/20;doree...@triton.uog.edu] 276 14 c3a5 [åni" Isidro;Isidro;Jaevani;Junior Web Developer;Office of Information Technology;735-2631;jisi...@triton.uog.edu] 344 18 c381 [Álvarez-Piñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu] 344 29 c3b1 [ñer, Ph.D.;Madrid Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu] 344 48 c381 [Álvarez-Piñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu] 344 59 c3b1 [ñer;Carlos;Director / Associate Professor of Spanish Pacific History;Micronesian Area Research Center;735-2156;madr...@triton.uog.edu] tried a number of things with iconv, but still ended with the problem format. Again, thanks for the time. > -- > Did 황준호 die? > _______________________________________________ > users mailing list -- users@lists.fedoraproject.org > To unsubscribe send an email to users-le...@lists.fedoraproject.org > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org > Do not reply to spam on the list, report it: > https://pagure.io/fedora-infrastructure
_______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure