Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-28 Thread Nestor via Digitalmars-d-learn

On Friday, 27 January 2017 at 04:26:31 UTC, Era Scarecrow wrote:
 Skipping the BOM is just a matter of skipping the first two 
bytes identifying it...


AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), so 
when input encoding is unknown one must perform some kind of 
detection in order to apply the correct transcoding later. I 
thought by now dmd had this functionality built-in and exposed, 
since the compiler itself seems to do it for source code units.


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-27 Thread Era Scarecrow via Digitalmars-d-learn

On Friday, 27 January 2017 at 07:02:52 UTC, Jack Applegame wrote:

On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
static char[1024*4] buffer;  //4k reusable buffer, NOT 
thread safe


Maybe I'm wrong, but I think it's thread safe. Because static 
mutable non-shared variables are stored in TLS.


 Perhaps, but fibers or other instances of sharing the buffer 
wouldn't be safe/reliable, at least not for long.


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-26 Thread Jack Applegame via Digitalmars-d-learn

On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:
static char[1024*4] buffer;  //4k reusable buffer, NOT 
thread safe


Maybe I'm wrong, but I think it's thread safe. Because static 
mutable non-shared variables are stored in TLS.




Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-26 Thread Era Scarecrow via Digitalmars-d-learn

On Tuesday, 17 January 2017 at 11:40:15 UTC, Nestor wrote:
Thanks, but unfortunately this function does not produce proper 
UTF8 strings, as a matter of fact the output even starts with 
the BOM. Also it doesn't handle CRLF, and even for LF 
terminated lines it doesn't seem to work for lines other than 
the first.


 I thought you wanted to get line by line of contents, which 
would then remain as UTF-16. Translating between the two types 
shouldn't be hard, probably to!string or a foreach with appending 
to code-units on chars would convert to UTF-8.


 Skipping the BOM is just a matter of skipping the first two 
bytes identifying it...


I guess I have to code encoding detection, buffered read, and 
transcoding by hand, the only problem is that the result could 
be sub-optimal, which is why I was looking for a built-in 
solution.


 Maybe. Honestly I'm not nearly as familiar with the library or 
functions as I would love to be, so often home-made solutions 
seem more prevalent until I learn the lingo. A disadvantage of 
being self taught.


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-17 Thread Nestor via Digitalmars-d-learn

On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote:

On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:

I see. So correcting my original doubt:

How could I parse an UTF16LE file line by line (producing a 
proper string in each iteration) without loading the entire 
file into memory?


Could... roll your own? Although if you wanted it to be UTF-8 
output instead would require a second pass or better yet 
changing how the i iterated.


char[] getLine16LE(File inp = stdin) {
static char[1024*4] buffer;  //4k reusable buffer, NOT 
thread safe

int i;
while(inp.rawRead(buffer[i .. i+2]) != null) {
if (buffer[i] == '\n')
break;

i+=2;
}

return buffer[0 .. i];
}


Thanks, but unfortunately this function does not produce proper 
UTF8 strings, as a matter of fact the output even starts with the 
BOM. Also it doen't handle CRLF, and even for LF terminated lines 
it doesn't seem to work for lines other than the first.


I guess I have to code encoding detection, buffered read, and 
transcoding by hand, the only problem is that the result could be 
sub-optimal, which is why I was looking for a built-in solution.


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-16 Thread Era Scarecrow via Digitalmars-d-learn

On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote:

I see. So correcting my original doubt:

How could I parse an UTF16LE file line by line (producing a 
proper string in each iteration) without loading the entire 
file into memory?


Could... roll your own? Although if you wanted it to be UTF-8 
output instead would require a second pass or better yet changing 
how the i iterated.


char[] getLine16LE(File inp = stdin) {
static char[1024*4] buffer;  //4k reusable buffer, NOT thread 
safe

int i;
while(inp.rawRead(buffer[i .. i+2]) != null) {
if (buffer[i] == '\n')
break;

i+=2;
}

return buffer[0 .. i];
}


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-15 Thread Nestor via Digitalmars-d-learn

On Sunday, 15 January 2017 at 16:29:23 UTC, Daniel Kozák wrote:
This is because byLine does return range, so until you do 
something with that it does not cause any harm :)


I see. So correcting my original doubt:

How could I parse an UTF16LE file line by line (producing a 
proper string in each iteration) without loading the entire file 
into memory?


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-15 Thread Daniel Kozák via Digitalmars-d-learn
V Sun, 15 Jan 2017 14:48:12 +
Nestor via Digitalmars-d-learn  napsáno:

> On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote:
> > On 01/06/2017 11:33 AM, pineapple wrote:  
> >> On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:  
> 
>  I'm not sure if this works quite as intended, but I was at 
>  least able
>  to produce a UTF-16 decode error rather than a UTF-8 decode 
>  error by
>  setting the file orientation before reading it.
> 
>  import std.stdio;
>  import core.stdc.wchar_ : fwide;
>  void main(){
>  auto file = File("UTF-16LE encoded file.txt");
>  fwide(file.getFP(), 1);
>  foreach(line; file.byLine){
>  writeln(file.readln);
>  }
>  }  
> >>>
> >>> fwide is not implemented in Windows:
> >>> https://msdn.microsoft.com/en-us/library/aa985619.aspx  
> >>
> >> That's odd. It was on Windows 7 64-bit that I put together and 
> >> tested
> >> that example, and calling fwide definitely had an effect on 
> >> program
> >> behavior.  
> >
> > Are you compiling a 32bit binary? Because in that case you 
> > would be using the digital mars c runtime which might have an 
> > implementation for fwide.  
> 
> After some testing I realized that byLine was not the one 
> failing, but any string manipulation done to the obtained line. 
> Compile the following example with and without -debug and run to 
> see what I mean:
> 
> import std.stdio, std.string;
> 
> enum
>EXIT_SUCCESS = 0,
>EXIT_FAILURE = 1;
> 
> int main() {
>version(Windows) {
>  import core.sys.windows.wincon;
>  SetConsoleOutputCP(65001);
>}
>auto f = File("utf16le.txt", "r");
>foreach (line; f.byLine()) try {
>  string s;
>  debug s = cast(string)strip(line); // this is the one causing 
> problems
>  if (1 > s.length) continue;
>  writeln(s);
>} catch(Exception e) {
>  writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, 
> e.line);
>  return EXIT_FAILURE;
>}
>return EXIT_SUCCESS;
> }

This is because byLine does return range, so until you do something with that
it does not cause any harm :)



Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-15 Thread Nestor via Digitalmars-d-learn

On Sunday, 15 January 2017 at 14:48:12 UTC, Nestor wrote:
After some testing I realized that byLine was not the one 
failing, but any string manipulation done to the obtained line. 
Compile the following example with and without -debug and run 
to see what I mean:


import std.stdio, std.string;

enum
  EXIT_SUCCESS = 0,
  EXIT_FAILURE = 1;

int main() {
  version(Windows) {
import core.sys.windows.wincon;
SetConsoleOutputCP(65001);
  }
  auto f = File("utf16le.txt", "r");
  foreach (line; f.byLine()) try {
string s;
debug s = cast(string)strip(line); // this is the one 
causing problems

if (1 > s.length) continue;
writeln(s);
  } catch(Exception e) {
writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, 
e.line);

return EXIT_FAILURE;
  }
  return EXIT_SUCCESS;
}


By the way, when caught, the exception says it's in file 
src/phobos/std/utf.d line 1217, but that file only has 784 lines. 
That's quite odd.


(I am compiling with dmd 2.072.2)


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-15 Thread Nestor via Digitalmars-d-learn

On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote:

On 01/06/2017 11:33 AM, pineapple wrote:

On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:


I'm not sure if this works quite as intended, but I was at 
least able
to produce a UTF-16 decode error rather than a UTF-8 decode 
error by

setting the file orientation before reading it.

import std.stdio;
import core.stdc.wchar_ : fwide;
void main(){
auto file = File("UTF-16LE encoded file.txt");
fwide(file.getFP(), 1);
foreach(line; file.byLine){
writeln(file.readln);
}
}


fwide is not implemented in Windows:
https://msdn.microsoft.com/en-us/library/aa985619.aspx


That's odd. It was on Windows 7 64-bit that I put together and 
tested
that example, and calling fwide definitely had an effect on 
program

behavior.


Are you compiling a 32bit binary? Because in that case you 
would be using the digital mars c runtime which might have an 
implementation for fwide.


After some testing I realized that byLine was not the one 
failing, but any string manipulation done to the obtained line. 
Compile the following example with and without -debug and run to 
see what I mean:


import std.stdio, std.string;

enum
  EXIT_SUCCESS = 0,
  EXIT_FAILURE = 1;

int main() {
  version(Windows) {
import core.sys.windows.wincon;
SetConsoleOutputCP(65001);
  }
  auto f = File("utf16le.txt", "r");
  foreach (line; f.byLine()) try {
string s;
debug s = cast(string)strip(line); // this is the one causing 
problems

if (1 > s.length) continue;
writeln(s);
  } catch(Exception e) {
writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, 
e.line);

return EXIT_FAILURE;
  }
  return EXIT_SUCCESS;
}


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-06 Thread Mike Wey via Digitalmars-d-learn

On 01/06/2017 11:33 AM, pineapple wrote:

On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:


I'm not sure if this works quite as intended, but I was at least able
to produce a UTF-16 decode error rather than a UTF-8 decode error by
setting the file orientation before reading it.

import std.stdio;
import core.stdc.wchar_ : fwide;
void main(){
auto file = File("UTF-16LE encoded file.txt");
fwide(file.getFP(), 1);
foreach(line; file.byLine){
writeln(file.readln);
}
}


fwide is not implemented in Windows:
https://msdn.microsoft.com/en-us/library/aa985619.aspx


That's odd. It was on Windows 7 64-bit that I put together and tested
that example, and calling fwide definitely had an effect on program
behavior.


Are you compiling a 32bit binary? Because in that case you would be 
using the digital mars c runtime which might have an implementation for 
fwide.


--
Mike Wey


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-06 Thread pineapple via Digitalmars-d-learn

On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:


I'm not sure if this works quite as intended, but I was at 
least able to produce a UTF-16 decode error rather than a 
UTF-8 decode error by setting the file orientation before 
reading it.


import std.stdio;
import core.stdc.wchar_ : fwide;
void main(){
auto file = File("UTF-16LE encoded file.txt");
fwide(file.getFP(), 1);
foreach(line; file.byLine){
writeln(file.readln);
}
}


fwide is not implemented in Windows: 
https://msdn.microsoft.com/en-us/library/aa985619.aspx


That's odd. It was on Windows 7 64-bit that I put together and 
tested that example, and calling fwide definitely had an effect 
on program behavior.


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-05 Thread rumbu via Digitalmars-d-learn


I'm not sure if this works quite as intended, but I was at 
least able to produce a UTF-16 decode error rather than a UTF-8 
decode error by setting the file orientation before reading it.


import std.stdio;
import core.stdc.wchar_ : fwide;
void main(){
auto file = File("UTF-16LE encoded file.txt");
fwide(file.getFP(), 1);
foreach(line; file.byLine){
writeln(file.readln);
}
}


fwide is not implemented in Windows: 
https://msdn.microsoft.com/en-us/library/aa985619.aspx





Re: Parsing a UTF-16LE file line by line?

2017-01-05 Thread Steven Schveighoffer via Digitalmars-d-learn

On 1/4/17 6:03 AM, Nestor wrote:

Hi,

I was just trying to parse a UTF-16LE file using byLine, but apparently
this function doesn't work with anything other than UTF-8, because I get
this error:

"Invalid UTF-8 sequence (at index 1)"

How can I achieve what I want, without loading the entire file into memory?

Thanks in advance.


I have not tested much with UTF16 and std.stdio, but I don't believe the 
underlying FILE * being used by phobos has good support for it.


In my testing, for instance, byLine with a non-ascii delimeter didn't 
work at all.


On Windows 64-bit, MSVC simply ignores any attempts to change the width 
of the stream.


I wouldn't hold out much hope for this to be fixed.

-Steve


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-04 Thread Daniel Kozák via Digitalmars-d-learn
Nestor via Digitalmars-d-learn  
napsal St, led 4, 2017 v 8∶20 :

On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
Ok, I've done some testing and you are right byLine is broken, so 
please fill a bug


A bug? I was under the impression that this function was *intended* 
to work only with UTF-8 encoded files.


Impression is nice but there is nothing about it, so anyone who will 
read doc will expect it to work on any encoding.
And from doc I see there is a way how one can select encoding and even 
select Terminator and its type, and this does not works so I expect it 
is a bug.


Another wierd behaviour is when you read file as wstring it will try to 
decode it as utf8, then encode it to utf16, but even if it works (for 
utf8 files), and you end up with wstring lines (wstring[]) and you try 
to save it, it will automaticly save it as utf8. WTF this is really 
wrong and if it is intended it should be documentet better. Right now 
it is really hard to work with dlang stdio.


But I hoppe it will be deprecated someday and replace with something 
what support ranges and async io


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-04 Thread pineapple via Digitalmars-d-learn

On Wednesday, 4 January 2017 at 19:20:31 UTC, Nestor wrote:
On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák 
wrote:
Ok, I've done some testing and you are right byLine is broken, 
so please fill a bug


A bug? I was under the impression that this function was 
*intended* to work only with UTF-8 encoded files.


I'm not sure if this works quite as intended, but I was at least 
able to produce a UTF-16 decode error rather than a UTF-8 decode 
error by setting the file orientation before reading it.


import std.stdio;
import core.stdc.wchar_ : fwide;
void main(){
auto file = File("UTF-16LE encoded file.txt");
fwide(file.getFP(), 1);
foreach(line; file.byLine){
writeln(file.readln);
}
}


Re: Parsing a UTF-16LE file line by line, BUG?

2017-01-04 Thread Nestor via Digitalmars-d-learn

On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
Ok, I've done some testing and you are right byLine is broken, 
so please fill a bug


A bug? I was under the impression that this function was 
*intended* to work only with UTF-8 encoded files.


Re: Parsing a UTF-16LE file line by line?

2017-01-04 Thread Daniel Kozák via Digitalmars-d-learn

Daniel Kozák  napsal St, led 4, 2017 v 6∶33 :


Nestor via Digitalmars-d-learn  
napsal St, led 4, 2017 v 12∶03 :

Hi,

I was just trying to parse a UTF-16LE file using byLine, but 
apparently this function doesn't work with anything other than 
UTF-8, because I get this error:


"Invalid UTF-8 sequence (at index 1)"

How can I achieve what I want, without loading the entire file into 
memory?


Thanks in advance.
can you show your code, byLine should works ok, and post some example 
of utf16-le file which does not works


Ok, I've done some testing and you are right byLine is broken, so 
please fill a bug





Re: Parsing a UTF-16LE file line by line?

2017-01-04 Thread Daniel Kozák via Digitalmars-d-learn


Nestor via Digitalmars-d-learn  
napsal St, led 4, 2017 v 12∶03 :

Hi,

I was just trying to parse a UTF-16LE file using byLine, but 
apparently this function doesn't work with anything other than UTF-8, 
because I get this error:


"Invalid UTF-8 sequence (at index 1)"

How can I achieve what I want, without loading the entire file into 
memory?


Thanks in advance.
can you show your code, byLine should works ok, and post some example 
of utf16-le file which does not works