regarding Latin1 to UTF8 encoding

2013-12-08 Thread Hugo Florentino

Hi,

I am having some problems trygin to pass regular expressions to a 
webpage encoded in Latin1. I have unsuccessfully tried to convert it to 
UTF8 before passing the regular expression.


Initially I tried to do something like this:

auto input = readText(myfile.htm);
auto output = replace(input, re1, re2);

But I got this error when trying to run the application:
std.utf.UTFException@C:\DMD2\Windows\bin\..\..\src\phobos\std\utf.d(1113): 
Invalid UTF-8 sequence (at index 1)


I then tried this, but the error remains

auto input = readText(myfile.htm);
string buffer;
transcode(input, buffer);
auto output = replace(buffer, re1, re2);

Also, this did not work:

auto input = cast(string) read(myfile.htm);
string buffer;
transcode(input, buffer);
auto output = replace(buffer, re1, re2);

core.exception.AssertError@std.encoding(1995): Assertion failure

Please, any help would be appreciated.

Regards, Hugo


Re: regarding Latin1 to UTF8 encoding

2013-12-08 Thread Adam D. Ruppe

On Monday, 9 December 2013 at 02:40:29 UTC, Hugo Florentino wrote:

auto input = readText(myfile.htm);


Don't use readText if it isn't utf-8; readtext assumes it is utf 
8.


I've never actually used std.encoding (I wrote my own encoding 
module for my dom.d, which I used for website scraping too) but I 
think this is what you want:


Latin1String input = cast(Latin1String) 
std.file.read(myfile.htm);

string buffer;
transcode(input, buffer);
auto output = replace(buffer, re1, re2);


see if that works


Re: regarding Latin1 to UTF8 encoding

2013-12-08 Thread Hugo Florentino

On Mon, 09 Dec 2013 03:44:19 +0100, Adam D. Ruppe wrote:

On Monday, 9 December 2013 at 02:40:29 UTC, Hugo Florentino wrote:

auto input = readText(myfile.htm);


Don't use readText if it isn't utf-8; readtext assumes it is utf 8.

I've never actually used std.encoding (I wrote my own encoding module
for my dom.d, which I used for website scraping too) but I think this
is what you want:

Latin1String input = cast(Latin1String) std.file.read(myfile.htm);
string buffer;
transcode(input, buffer);
auto output = replace(buffer, re1, re2);


see if that works


Actually, it did work, even keeping input type as auto.
It seems the explicit typecast to Lating1String was the required 
element for it to work, which makes sense now that I think about it.


Thanks a lot for the (amazingly quick) reply ;)

Now, if I may add a closely related doubt:

Suppose myfile.txt was given to me daily by careless people who 
usually save it as Latin1 but from time to time might save it as UTF8.
Is there a way to detect the encoding prior to typecasting/loading the 
file?


Regards, Hugo


Re: regarding Latin1 to UTF8 encoding

2013-12-08 Thread Adam D. Ruppe

On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:
Is there a way to detect the encoding prior to 
typecasting/loading the file?


UTF-8 can be detected fairly reliably, but not much luck for 
other encodings. A Windows-1258 and a Latin1 file, for example, 
are usually fairly indistinguishable from a binary perspective - 
they use the same numbers, just for different things.


(It is possible to distinguish them if you use some context and 
grammar check kind of things, but that's not easy.)



But utf-8 has a neat feature: any non-ascii stuff needs to 
validate, and it is unlikely that random data would correctly 
validate.


std.utf.validate can do that (though it throws an exception if it 
fails, ugh!)


So here's how I did it in my own characterencodings.d:

https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138


string utf8string;
import std.utf;
try {
validate!string(cast(string) rawdata);
// validation passed, assume it is UTF-8 and use 
it

utf8string = cast(string) rawdata;
} catch(UTFException t) {
   // not utf-8, try latin1
   transcode(cast(Latin1String) rawData, utf8string);
}

// now go ahead and use utf8 string, it should be set


Re: regarding Latin1 to UTF8 encoding

2013-12-08 Thread Hugo Florentino

On Mon, 09 Dec 2013 04:19:51 +0100, Adam D. Ruppe wrote:

On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:
Is there a way to detect the encoding prior to typecasting/loading 
the file?


UTF-8 can be detected fairly reliably, but not much luck for other
encodings. A Windows-1258 and a Latin1 file, for example, are usually
fairly indistinguishable from a binary perspective - they use the 
same

numbers, just for different things.

(It is possible to distinguish them if you use some context and
grammar check kind of things, but that's not easy.)


But utf-8 has a neat feature: any non-ascii stuff needs to validate,
and it is unlikely that random data would correctly validate.

std.utf.validate can do that (though it throws an exception if it
fails, ugh!)

So here's how I did it in my own characterencodings.d:


https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138


string utf8string;
import std.utf;
try {
validate!string(cast(string) rawdata);
// validation passed, assume it is UTF-8 and use it
utf8string = cast(string) rawdata;
} catch(UTFException t) {
   // not utf-8, try latin1
   transcode(cast(Latin1String) rawData, utf8string);
}

// now go ahead and use utf8 string, it should be set


Clever solution, thanks.
Coud this work using scope instead of try/catch?

P.S. Nice unit, by the way.


Re: regarding Latin1 to UTF8 encoding

2013-12-08 Thread Adam D. Ruppe

On Monday, 9 December 2013 at 03:33:46 UTC, Hugo Florentino wrote:

Coud this work using scope instead of try/catch?


Maybe, but I don't think it would be very pretty. Really, I think 
validate should return a bool instead of throwing, but since it 
doesn't the try/catch is as close as it gets.



P.S. Nice unit, by the way.


BTW if you need to parse random html, grab that file and my dom.d 
from the same repo.


auto document = new Document();
document.parseGarbage(whatever_data);

parseGarbage tries to determine the character encoding 
automatically, from the validate check or the meta tags in the 
HTML if they are there, then guessing if not. It is pretty good 
at parsing broken html tag soup to make a dom similar to the 
browser.


Then you can get data out of it doing things like

auto firstParagraph = document.querySelector(p:first-child);
if(firstParagraph is null) writeln(no first child paragraph);
else writeln(first child paragraph text: , 
firstParagraph.innerText);


and stuff like that, if you have used Javascript before dom.d 
should look fairly familiar.