Using lazy code to process large files

2017-08-02 Thread Martin Drašar via Digitalmars-d-learn
Hi,

I am struggling to use a lazy range-based code to process large text
files. My task is simple, i.e., I can write a non-range-based code in a
really short time, but I wanted to try different approach and I am
hitting a wall after wall.

Task: read a csv-like input, take only lines starting with some string,
split by a comma, remove leading and trailing whitespaces from splitted
elements, join by comma again and write to an output.

My attempt so far:

alias stringStripLeft = std.string.stripLeft;

auto input  = File("input.csv");
auto output = File("output.csv");

auto result = input.byLine()
   .filter!(a => a.startsWith("..."))
   .map!(a => a.splitter(","))
   .stringStripleft // <-- errors start here
   .join(",");

output.write(result);

Needless to say, this does not compile. Basically, I don't know how to
feed MapResults to splitter and then sensibly join it.

Thank you for any hint.
Martin


Re: Using lazy code to process large files

2017-08-02 Thread Daniel Kozak via Digitalmars-d-learn
import std.stdio;
import std.algorithm;

void main()
{


auto input  = ["... some text, another text", "some,another","...so,an"];

auto result = input.filter!(a => a.startsWith("..."))
  .map!(a=>a.splitter(",").map!(a=>a.stripLeft(' ')))
  .map!(a=>a.joiner(","));
writeln(result);
}


On Wed, Aug 2, 2017 at 1:44 PM, Martin Drašar via Digitalmars-d-learn <
digitalmars-d-learn@puremagic.com> wrote:

> Hi,
>
> I am struggling to use a lazy range-based code to process large text
> files. My task is simple, i.e., I can write a non-range-based code in a
> really short time, but I wanted to try different approach and I am
> hitting a wall after wall.
>
> Task: read a csv-like input, take only lines starting with some string,
> split by a comma, remove leading and trailing whitespaces from splitted
> elements, join by comma again and write to an output.
>
> My attempt so far:
>
> alias stringStripLeft = std.string.stripLeft;
>
> auto input  = File("input.csv");
> auto output = File("output.csv");
>
> auto result = input.byLine()
>.filter!(a => a.startsWith("..."))
>.map!(a => a.splitter(","))
>.stringStripleft // <-- errors start here
>.join(",");
>
> output.write(result);
>
> Needless to say, this does not compile. Basically, I don't know how to
> feed MapResults to splitter and then sensibly join it.
>
> Thank you for any hint.
> Martin
>


Re: Using lazy code to process large files

2017-08-02 Thread Martin Drašar via Digitalmars-d-learn
Dne 2.8.2017 v 14:11 Daniel Kozak via Digitalmars-d-learn napsal(a):
> import std.stdio;
> import std.algorithm;
> 
> void main()
> {
> 
> 
> auto input  = ["... some text, another text", "some,another","...so,an"];
> 
> auto result = input.filter!(a => a.startsWith("..."))
>   .map!(a=>a.splitter(",").map!(a=>a.stripLeft(' ')))
>   .map!(a=>a.joiner(","));
> writeln(result);
> }


Thanks a lot. That did the trick.

Martin



Re: Using lazy code to process large files

2017-08-02 Thread Steven Schveighoffer via Digitalmars-d-learn

On 8/2/17 7:44 AM, Martin Drašar via Digitalmars-d-learn wrote:

Hi,

I am struggling to use a lazy range-based code to process large text
files. My task is simple, i.e., I can write a non-range-based code in a
really short time, but I wanted to try different approach and I am
hitting a wall after wall.

Task: read a csv-like input, take only lines starting with some string,
split by a comma, remove leading and trailing whitespaces from splitted
elements, join by comma again and write to an output.

My attempt so far:

alias stringStripLeft = std.string.stripLeft;

auto input  = File("input.csv");
auto output = File("output.csv");

auto result = input.byLine()
.filter!(a => a.startsWith("..."))
.map!(a => a.splitter(","))
.stringStripleft // <-- errors start here
.join(",");

output.write(result);

Needless to say, this does not compile. Basically, I don't know how to
feed MapResults to splitter and then sensibly join it.


The problem is that you are 2 ranges deep when you apply splitter. The 
result of the map is a range of ranges.


Then when you apply stringStripleft, you are applying to the map result, 
not the splitter result.


What you need is to bury the action on each string into the map:

.map!(a => a.splitter(",").map!(stringStripLeft).join(","))

The internal map is because stripLeft doesn't take a range of strings 
(the result of splitter), it takes a range of dchar (which is each 
element of splitter). So you use map to apply the function to every element.


Disclaimer: I haven't tested to see this works, but I think it should.

Note that I have forwarded your call to join, even though this actually 
is not lazy, it builds a string out of it (and actually probably a 
dstring). Use joiner to do it truly lazily.


I will also note that the result is not going to look like what you 
think, as outputting a range looks like this: [element, element, 
element, ...]


You could potentially output like this:

output.write(result.joiner("\n"));

Which I think will work. Again, no testing.

I wouldn't expect good performance from this, as there is auto-decoding 
all over the place.


-Steve


Re: Using lazy code to process large files

2017-08-02 Thread kdevel via Digitalmars-d-learn

On Wednesday, 2 August 2017 at 11:44:30 UTC, Martin Drašar wrote:

Thank you for any hint.


  1 import std.stdio;
  2 import std.string;
  3 import std.algorithm;
  4 import std.conv;
  5
  6 void main ()
  7 {
  8auto input  = File("input.csv");
  9
 10auto result = input.byLine()
 11   .filter!(a => a.startsWith("..."))
 12   .map!(a => a.splitter(",")
 13  .map!(b => b.stripLeft)
 14  .join(","))
 15   .join("\n");
 16
 17auto output = File("output.csv", "w");
 18output.write(result);
 19 }



Re: Using lazy code to process large files

2017-08-02 Thread Martin Drašar via Digitalmars-d-learn
Dne 2.8.2017 v 14:45 Steven Schveighoffer via Digitalmars-d-learn napsal(a):

> The problem is that you are 2 ranges deep when you apply splitter. The
> result of the map is a range of ranges.
> 
> Then when you apply stringStripleft, you are applying to the map result,
> not the splitter result.
> 
> What you need is to bury the action on each string into the map:
> 
> .map!(a => a.splitter(",").map!(stringStripLeft).join(","))
> 
> The internal map is because stripLeft doesn't take a range of strings
> (the result of splitter), it takes a range of dchar (which is each
> element of splitter). So you use map to apply the function to every
> element.
> 
> Disclaimer: I haven't tested to see this works, but I think it should.
> 
> Note that I have forwarded your call to join, even though this actually
> is not lazy, it builds a string out of it (and actually probably a
> dstring). Use joiner to do it truly lazily.
> 
> I will also note that the result is not going to look like what you
> think, as outputting a range looks like this: [element, element,
> element, ...]
> 
> You could potentially output like this:
> 
> output.write(result.joiner("\n"));
> 
> Which I think will work. Again, no testing.
> 
> I wouldn't expect good performance from this, as there is auto-decoding
> all over the place.
> 
> -Steve

Thanks Steven for the explanation. Just to clarify - what would be
needed to avoid auto-decoding in this case? Process it all as an arrays,
using byChunk to read it, etc?

@kdevel: Thank you for your solution as well.

Martin


Re: Using lazy code to process large files

2017-08-02 Thread Daniel Kozak via Digitalmars-d-learn
using http://dlang.org/phobos/std_utf.html#byCodeUnit could help

On Wed, Aug 2, 2017 at 2:59 PM, Martin Drašar via Digitalmars-d-learn <
digitalmars-d-learn@puremagic.com> wrote:

> Dne 2.8.2017 v 14:45 Steven Schveighoffer via Digitalmars-d-learn
> napsal(a):
>
> > The problem is that you are 2 ranges deep when you apply splitter. The
> > result of the map is a range of ranges.
> >
> > Then when you apply stringStripleft, you are applying to the map result,
> > not the splitter result.
> >
> > What you need is to bury the action on each string into the map:
> >
> > .map!(a => a.splitter(",").map!(stringStripLeft).join(","))
> >
> > The internal map is because stripLeft doesn't take a range of strings
> > (the result of splitter), it takes a range of dchar (which is each
> > element of splitter). So you use map to apply the function to every
> > element.
> >
> > Disclaimer: I haven't tested to see this works, but I think it should.
> >
> > Note that I have forwarded your call to join, even though this actually
> > is not lazy, it builds a string out of it (and actually probably a
> > dstring). Use joiner to do it truly lazily.
> >
> > I will also note that the result is not going to look like what you
> > think, as outputting a range looks like this: [element, element,
> > element, ...]
> >
> > You could potentially output like this:
> >
> > output.write(result.joiner("\n"));
> >
> > Which I think will work. Again, no testing.
> >
> > I wouldn't expect good performance from this, as there is auto-decoding
> > all over the place.
> >
> > -Steve
>
> Thanks Steven for the explanation. Just to clarify - what would be
> needed to avoid auto-decoding in this case? Process it all as an arrays,
> using byChunk to read it, etc?
>
> @kdevel: Thank you for your solution as well.
>
> Martin
>


Re: Using lazy code to process large files

2017-08-02 Thread Daniel Kozak via Digitalmars-d-learn
something like file.byLine.map!(a=>a.byCodeUnit)

On Wed, Aug 2, 2017 at 3:01 PM, Daniel Kozak  wrote:

> using http://dlang.org/phobos/std_utf.html#byCodeUnit could help
>
> On Wed, Aug 2, 2017 at 2:59 PM, Martin Drašar via Digitalmars-d-learn <
> digitalmars-d-learn@puremagic.com> wrote:
>
>> Dne 2.8.2017 v 14:45 Steven Schveighoffer via Digitalmars-d-learn
>> napsal(a):
>>
>> > The problem is that you are 2 ranges deep when you apply splitter. The
>> > result of the map is a range of ranges.
>> >
>> > Then when you apply stringStripleft, you are applying to the map result,
>> > not the splitter result.
>> >
>> > What you need is to bury the action on each string into the map:
>> >
>> > .map!(a => a.splitter(",").map!(stringStripLeft).join(","))
>> >
>> > The internal map is because stripLeft doesn't take a range of strings
>> > (the result of splitter), it takes a range of dchar (which is each
>> > element of splitter). So you use map to apply the function to every
>> > element.
>> >
>> > Disclaimer: I haven't tested to see this works, but I think it should.
>> >
>> > Note that I have forwarded your call to join, even though this actually
>> > is not lazy, it builds a string out of it (and actually probably a
>> > dstring). Use joiner to do it truly lazily.
>> >
>> > I will also note that the result is not going to look like what you
>> > think, as outputting a range looks like this: [element, element,
>> > element, ...]
>> >
>> > You could potentially output like this:
>> >
>> > output.write(result.joiner("\n"));
>> >
>> > Which I think will work. Again, no testing.
>> >
>> > I wouldn't expect good performance from this, as there is auto-decoding
>> > all over the place.
>> >
>> > -Steve
>>
>> Thanks Steven for the explanation. Just to clarify - what would be
>> needed to avoid auto-decoding in this case? Process it all as an arrays,
>> using byChunk to read it, etc?
>>
>> @kdevel: Thank you for your solution as well.
>>
>> Martin
>>
>
>


Re: Using lazy code to process large files

2017-08-02 Thread Steven Schveighoffer via Digitalmars-d-learn

On 8/2/17 8:59 AM, Martin Drašar via Digitalmars-d-learn wrote:


Thanks Steven for the explanation. Just to clarify - what would be
needed to avoid auto-decoding in this case? Process it all as an arrays,
using byChunk to read it, etc?



As Daniel said, using byCodeUnit will help.

I don't know what the result of this is when outputting, however. I'd be 
concerned it just integer promoted the data to dchars before outputting. 
If your file data is all ASCII it should work fine. You'd have to 
experiment to see how it works.


-Steve


Re: Using lazy code to process large files

2017-08-02 Thread kdevel via Digitalmars-d-learn
On Wednesday, 2 August 2017 at 13:45:01 UTC, Steven Schveighoffer 
wrote:

As Daniel said, using byCodeUnit will help.


stripLeft seems to autodecode even when fed with CodeUnits. How 
do I prevent this?


  1 void main ()
  2 {
  3import std.stdio;
  4import std.string;
  5import std.conv;
  6import std.utf;
  7import std.algorithm;
  8
  9string [] src = [ " \xfc" ]; // blank + latin-1 
encoded u umlaut

 10auto result = src
 11   .map!(a => a.byCodeUnit)
 12   .map!(a => a.stripLeft);
 13result.writeln;
 14 }

Crashes with a C++-like dump.



Re: Using lazy code to process large files

2017-08-02 Thread Steven Schveighoffer via Digitalmars-d-learn

On 8/2/17 11:02 AM, kdevel wrote:

On Wednesday, 2 August 2017 at 13:45:01 UTC, Steven Schveighoffer wrote:

As Daniel said, using byCodeUnit will help.


stripLeft seems to autodecode even when fed with CodeUnits. How do I 
prevent this?


   1 void main ()
   2 {
   3import std.stdio;
   4import std.string;
   5import std.conv;
   6import std.utf;
   7import std.algorithm;
   8
   9string [] src = [ " \xfc" ]; // blank + latin-1 encoded u 
umlaut

  10auto result = src
  11   .map!(a => a.byCodeUnit)
  12   .map!(a => a.stripLeft);
  13result.writeln;
  14 }

Crashes with a C++-like dump.



First, as a tip, please post either a link to a paste site, or don't put 
the line numbers. It's much easier to copy-paste your code into an 
editor if you don't have the line numbers.


What has happened is that you injected a non-encoded code point. In 
UTF8, any code point above 0x7f must be encoded into a string of several 
code units. See the table on this page: https://en.wikipedia.org/wiki/%C3%9C


If we use the correct code unit sequence (0xc3 0x9c), then it works: 
https://run.dlang.io/is/4umQoo


-Steve


Re: Using lazy code to process large files

2017-08-02 Thread kdevel via Digitalmars-d-learn
On Wednesday, 2 August 2017 at 15:52:13 UTC, Steven Schveighoffer 
wrote:


[...]

First, as a tip, please post either a link to a paste site, or 
don't put the line numbers. It's much easier to copy-paste your 
code into an editor if you don't have the line numbers.


With pleasure.

[...]


If we use the correct code unit sequence (0xc3 0x9c), then [...]


If I avoid std.string.stripLeft and use std.algorithm.stripLeft(' 
') instead it works as expected:


void main ()
{
   import std.stdio;
   import std.utf;
   import std.algorithm;

   string [] src = [ " \xfc" ]; // blank + latin-1 encoded u 
umlaut

   auto result = src
  .map!(a => a.byCodeUnit)
  .map!(a => a.stripLeft(' '));
   result.writeln;
}




Re: Using lazy code to process large files

2017-08-02 Thread Steven Schveighoffer via Digitalmars-d-learn

On 8/2/17 1:16 PM, kdevel wrote:

On Wednesday, 2 August 2017 at 15:52:13 UTC, Steven Schveighoffer wrote:


If we use the correct code unit sequence (0xc3 0x9c), then [...]


If I avoid std.string.stripLeft and use std.algorithm.stripLeft(' ') 
instead it works as expected:


What is expected? What I see on the screen when I run my code is:

[Ü]

What I see when I run your "working" code is:

[?]

You are missing the point that your input string is invalid.

std.algorithm is not validating the entire string, and so it doesn't 
throw an error like string.stripLeft does. writeln doesn't do any 
decoding of individual strings. It avoids the problem and just copies 
your bad data directly.


If you fix the input, both will work correctly.

-Steve


Re: Using lazy code to process large files

2017-08-02 Thread kdevel via Digitalmars-d-learn
On Wednesday, 2 August 2017 at 17:37:09 UTC, Steven Schveighoffer 
wrote:


What is expected? What I see on the screen when I run my code 
is:


[Ü]


Upper case?


What I see when I run your "working" code is:

[?]


Your terminal is incapable of rendering the Latin-1 encoding. The 
program prints one byte of value 0xfc. You may pipe the output 
into hexdump -C:


  5b fc 5d 0a   |[ü].|
0004


You are missing the point that your input string is invalid.


It's perfectly okay to put any value a octet can take into an 
octet. I did not claim that the data in the string memory is 
syntactically valid UTF-8. Read the comment in line 9 of my post 
of 15:02:22.



std.algorithm is not validating the entire string,


True and it should not. So this is what I want.


and so it doesn't throw an error like string.stripLeft does.


That is the point. You wrote

| I wouldn't expect good performance from this, as there is 
auto-decoding all

| over the place.

I erroneously thought that using byCodeUnit disables the whole 
UTF-8 processing and enforces operation on (u)bytes. But this is 
not the case at least not for stripLeft and probably other string 
functions.


writeln doesn't do any decoding of individual strings. It 
avoids the problem and just copies your bad data directly.


That is what I expected.




Re: Using lazy code to process large files

2017-08-02 Thread ag0aep6g via Digitalmars-d-learn

On 08/02/2017 08:28 PM, kdevel wrote:
It's perfectly okay to put any value a octet can take into an octet. I 
did not claim that the data in the string memory is syntactically valid 
UTF-8. Read the comment in line 9 of my post of 15:02:22.


You're claiming that the data is in UTF-8 when you use `string` as the 
type. For arbitrary octets, use something like `ubyte[]`.


Re: Using lazy code to process large files

2017-08-02 Thread Steven Schveighoffer via Digitalmars-d-learn

On 8/2/17 2:28 PM, kdevel wrote:

On Wednesday, 2 August 2017 at 17:37:09 UTC, Steven Schveighoffer wrote:


What is expected? What I see on the screen when I run my code is:

[Ü]


Upper case?


Sorry, should be c3 bc, not c3 9c. I misread the table on that wikipedia 
entry.



What I see when I run your "working" code is:

[?]


Your terminal is incapable of rendering the Latin-1 encoding. The 
program prints one byte of value 0xfc. You may pipe the output into 
hexdump -C:


  5b fc 5d 0a   |[ü].|
0004


Right, I saw that. But it's still not valid utf8, which is what char and 
string are.



You are missing the point that your input string is invalid.


It's perfectly okay to put any value a octet can take into an octet. I 
did not claim that the data in the string memory is syntactically valid 
UTF-8. Read the comment in line 9 of my post of 15:02:22.


Except a string is utf8, period. char is a utf8 code unit, period.

If you want some other encoding, it has to be defined as a different 
type. Otherwise, you will get errors when using any D library, all of 
which should expect char to be a utf8 code-unit.



std.algorithm is not validating the entire string,


True and it should not. So this is what I want.


But it's not the same as the original. For instance, the original would 
strip tabs, yours does not.



and so it doesn't throw an error like string.stripLeft does.


That is the point. You wrote

| I wouldn't expect good performance from this, as there is 
auto-decoding all

| over the place.

I erroneously thought that using byCodeUnit disables the whole UTF-8 
processing and enforces operation on (u)bytes. But this is not the case 
at least not for stripLeft and probably other string functions.


std.string.stripLeft is still expecting unicode, as it's testing 
std.uni.isWhite. So it has to do decoding. std.algorithm.stripLeft (the 
way you called it anyway) is looking at char instances and doing a 
direct comparison to ONE char (' '), so it can be much much faster and 
does not have to decode. This is an optimization, not a feature. I 
wouldn't be surprised, for instance, if byCodeUnit threw an error when 
encountering an invalid sequence in debug mode or something.


If your goal is to only look for that ascii character, then using 
byCodeUnit is required to avoid auto-decoding, which is where the 
unexpected slowdown would come.


But string functions that are specifically looking for unicode sequences 
are still going to decode, even if the range isn't doing it proactively.


In any case, the input data is not valid, you should use ubyte[], or 
some other type array, not strings.


-Steve