Re: Searching for a string in a text buffer with a regular expression

2013-12-09 Thread maxpat78

I mean a code fragment like this:

foreach(i; 1..2085)
{
		// Bugbug: when we read in the buffer, we can't know anything 
about its encoding...

// But REGEX could fail if it contained unknown chars!
Latin1String buf;
string s;

try
{
buf = cast(Latin1String) 
read(format(psi\\psi%04d.htm, i));
transcode(buf, s);
}
catch (Exception e)
{
writeln(Last record (, i, ) reached.);
exit(1);
}

// Exception Invalid UTF-8 sequence @index 1 in file 55
		enum rx = ctRegex!(`p class=aggiornamentoAlbo.+?/div`, 
gs);

auto m = match(s, rx);

if (! m.empty())
{
			if (indexOf(m.captures[0], , 0)  -1  
indexOf(m.captures[0], 1983, 0)  -1)

writeln(m.captures[0]);
}
}

The question is: what kind of cast should I use to safely 
(=without conversion exceptions got raised) scan all possible 
kind of textual (or binary) buffer, lile in Python 2.7.x?


Thanks!


Searching for a string in a text buffer with a regular expression

2013-12-06 Thread maxpat78
While porting a simple Python script to D, I found the following 
problem.


I need to read in some thousand of little text files and search 
every one for a match with a given regular expression.


Obviously, the program can't (and it should not) be certain about 
the encoding of each input file.


I initially used read() casting it with a cast(char[]), but, at 
some point, the regex engine crashed with an exception: it 
encountered an UTF-8 character it couldn't automatically decode. 
This is right, since char[] is not byte[].


Now I'm casting with a Latin1String, since I know this is the 
right encoding for the input buffers: and it works fine, at 
last... but what about if I'd need to treat a RAW (binary? 
unknown encoding?) buffer?


Is there a simple and elegant solution in D for such case?
Python didn't gave such problems!


Re: Searching for a string in a text buffer with a regular expression

2013-12-06 Thread bearophile

maxpat78:


Is there a simple and elegant solution in D for such case?
Python didn't gave such problems!


Do you mean Python3?

Bye,
bearophile


Re: Searching for a string in a text buffer with a regular expression

2013-12-06 Thread Shammah Chancellor

On 2013-12-06 08:53:04 +, maxpat78 said:


While porting a simple Python script to D, I found the following problem.

I need to read in some thousand of little text files and search every 
one for a match with a given regular expression.


Obviously, the program can't (and it should not) be certain about the 
encoding of each input file.


I initially used read() casting it with a cast(char[]), but, at some 
point, the regex engine crashed with an exception: it encountered an 
UTF-8 character it couldn't automatically decode. This is right, since 
char[] is not byte[].


Now I'm casting with a Latin1String, since I know this is the right 
encoding for the input buffers: and it works fine, at last... but what 
about if I'd need to treat a RAW (binary? unknown encoding?) buffer?


Is there a simple and elegant solution in D for such case?
Python didn't gave such problems!


Why don't you follow one of the file reading examples?

readText is what you're looking for.

http://dlang.org/phobos/std_file.html#.readText