Re: std.csv Performance Review

2017-06-05 Thread H. S. Teoh via Digitalmars-d
On Sun, Jun 04, 2017 at 03:59:03PM +, Jesse Phillips via Digitalmars-d 
wrote:
[...]
> Ok, I took you up on that, I'm still skeptical:
> 
> LDC2 -O3 -release -enable-cross-module-inlining
> 
> std.csv: 12487 msecs
> fastcsv (no gc): 1376 msecs
> csvslicing: 3039 msecs
> 
> That looks like about 10 times faster to me. Using the slicing version
> failed because of \r\n line endings (guess multi-part separators is
> broken) I changed the data file so I could get the execution time.

Thank you for testing it yourself.  I also tried to run the tests again
on my machine, and I can't seem to reproduce the 102136 msecs reading
again.  It seems that different compilers give somewhat readings, and
also we are using different compile flags.  In any case, in the spirit
of full disclosure, here's my test with the 3 compilers, that I just ran
just now just to be sure I'm not just copying old bad measurements:


$ dmd -O -inline benchmark.d fastcsv.d
$ ./benchmark stdstruct
std.csv read 2126883 records
std.csv (struct): 33119 msecs
$ ./benchmark faststruct2
fastcsv read 2126883 records
fastcsv (struct with const(char)[]): 2596 msecs

$ gdc -O3 -finline benchmark.d fastcsv.d -o benchmark
$ ./benchmark stdstruct
std.csv read 2126883 records
std.csv (struct): 23103 msecs
$ ./benchmark faststruct2
fastcsv read 2126883 records
fastcsv (struct with const(char)[]): 1909 msecs

$ ldc2 -O3 benchmark.d fastcsv.d 
$ ./benchmark stdstruct
std.csv read 2126883 records
std.csv (struct): 20776 msecs
$ ./benchmark faststruct2
fastcsv read 2126883 records
fastcsv (struct with const(char)[]): 1813 msecs


So, it looks like your 10x figure is more-or-less on target.  I've no
idea where the original 102136 msecs reading came from.  Perhaps that
was an unfortunate coincidence with a heavy load on my machine, or
something like that. Or maybe just a bungled copy-n-paste.


> Anyway, I'm not trying to claim fastcsv isn't good at what it does,
> all I'm trying to point out is std.csv is doing more work than these
> faster csv parsers. And I don't even want to claim that std.csv is
> better because of that work, it actually appears that it was a mistake
> to do validation.

I never intended for fastcsv to become a point of contention or as some
kind of competition with std.csv, and I apologize if I ever came across
that way.  I fully understand that std.csv does more work than fastcsv;
certainly, being able to assume an in-memory input and free slicing
gives a big advantage over being restricted to just input range
primitives. I had hoped to actually work fastcsv into a suitable form to
merge into std.csv -- to dispel wrong perceptions of D being "slow", you
see -- but it turned out to be more work than I had time for, so I
didn't get very far beyond the initial promising results.

My original hope was that the fastcsv code would be taken as a source of
ideas that we could adopt for speeding up std.csv, rather than be taken
in the way of "haha I wrote faster code than std.csv, so std.csv sux".
The latter was not my intention at all.

Anyway, I'm glad that you're looking into using slicing in std.csv. We
need Phobos modules to be highly performant so that newcomers don't get
the wrong impression about the language being slow. I'd also recommend
investigating reducing GC load, as I described in my previous post, as
another angle for improving the performance of std.csv.

As for whether to validate or not: if you were to ask me, I'd leave it
in, with a caveat in the docs that it would be less performant. As the
standard library, Phobos should give the user options, including the
option to validate input files that could potentially be malformed.  But
where the user knows the input is always well-formed, we can (and
should) take advantage of that to achieve better performance.


T

-- 
Why waste time reinventing the wheel, when you could be reinventing the engine? 
-- Damian Conway


Re: std.csv Performance Review

2017-06-05 Thread Jon Degenhardt via Digitalmars-d

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:

Author here:

The discussion[1] and articles[2] around "Faster Command Line 
Tools" had me trying out std.csv for the task.


[snip]

Keep in mind that the file itself has 10,512,769 rows of data 
with four columns. Now I've talked to std.csv's performance in 
the past, probably with the author of the fast command line 
tools.


[snip]


I'm the author of the TSV tools. I'd be happy to provide insights 
I've gained from this exercise to help improve std.csv. I did 
examine std.csv when I first starting writing the TSV tools, but 
decided they weren't appropriate for what I was trying to do. I 
don't remember the specifics at this point though. CSV imposes 
many requirements on an implementation that TSV does not, which 
makes it challenging to produce an implementation optimal for 
both.


--Jon



Re: std.csv Performance Review

2017-06-04 Thread Seb via Digitalmars-d

On Sunday, 4 June 2017 at 15:59:03 UTC, Jesse Phillips wrote:

On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:

[...]


Ok, I took you up on that, I'm still skeptical:

LDC2 -O3 -release -enable-cross-module-inlining

std.csv: 12487 msecs
fastcsv (no gc): 1376 msecs
csvslicing: 3039 msecs

That looks like about 10 times faster to me. Using the slicing 
version failed because of \r\n line endings (guess multi-part 
separators is broken) I changed the data file so I could get 
the execution time.


Anyway, I'm not trying to claim fastcsv isn't good at what it 
does, all I'm trying to point out is std.csv is doing more work 
than these faster csv parsers. And I don't even want to claim 
that std.csv is better because of that work, it actually 
appears that it was a mistake to do validation.


In case you have time, it would be very interesting to compare it 
with other state of the art tools like paratext:


http://www.wise.io/tech/paratext


Re: std.csv Performance Review

2017-06-04 Thread Jesse Phillips via Digitalmars-d

On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
On Sun, Jun 04, 2017 at 05:41:10AM +, Jesse Phillips via 
Digitalmars-d wrote:

On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
> Do you know what happened with fastcsv [0], original thread 
> [1].
> 
> [0] https://github.com/quickfur/fastcsv
> [1] 
> http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-le...@puremagic.com


I do not. Rereading that in light of this new article I'm a 
little sceptical of the 51 times faster, since I'm seeing only 
10x against these other implications.

[...]

You don't have to be skeptical, neither do you have to believe 
what I claimed.  I posted the entire code I used in the 
original thread, as well as the URLs of the exact data files I 
used for testing.  You can just run it yourself and see the 
results for yourself.


Ok, I took you up on that, I'm still skeptical:

LDC2 -O3 -release -enable-cross-module-inlining

std.csv: 12487 msecs
fastcsv (no gc): 1376 msecs
csvslicing: 3039 msecs

That looks like about 10 times faster to me. Using the slicing 
version failed because of \r\n line endings (guess multi-part 
separators is broken) I changed the data file so I could get the 
execution time.


Anyway, I'm not trying to claim fastcsv isn't good at what it 
does, all I'm trying to point out is std.csv is doing more work 
than these faster csv parsers. And I don't even want to claim 
that std.csv is better because of that work, it actually appears 
that it was a mistake to do validation.


Re: std.csv Performance Review

2017-06-04 Thread Anonymouse via Digitalmars-d

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:
Even though the issue can be ignored, the overhead of parsing 
to identify issues still remains. I haven't attempted write the 
algorithm assuming proper data structure so I don't know what 
the performance would look like, but I suspect it isn't 
negligible. There is also likely some overhead for providing 
the tokens through range interfaces.


Immediate idea is to have the cake and eat it too with 
parseXML!(Yes.validate)(...), but more work.


Re: std.csv Performance Review

2017-06-04 Thread Patrick Schluter via Digitalmars-d

On Sunday, 4 June 2017 at 06:54:46 UTC, Patrick Schluter wrote:

On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
On Sun, Jun 04, 2017 at 05:41:10AM +, Jesse Phillips via 
(Note that this is much less of a limitation than it seems; 
for example you could use std.mmfile to memory-map the file 
into your address space so that it doesn't actually have to 
fit into memory, and you can still take slices of it. The OS 
will manage the paging from/to disk for you. Of course, it 
will be slower when something has to be paged from disk, but 
IME this is often much faster than if you read the data into 
memory yourself.


If the file is in the file cache of the kernel, memory mapping 
does not need to reload the file as it is already in memory. In 
fact, calling mmap() changes only the sharing of the pages in 
general. That's where most of the performance win from memory 
mapping comes from.
To be precise, it's the copying of data that is spared by mmap. 
If the file is in the file cache, the open/read/write/close 
syscalls will also be fed from the memory mapped cache entry, but 
this requires that the data is copied from the kernel memory 
space to the processes buffer space. So each call to read will 
have to do this copying. So the gain from mmap comes for avoiding 
the copy of memory and avoiding the syscalls read/write/seek. The 
loading in memory of the physical file is the same in both cases.





This stackoverflow [1] discussion links to a realworldtech 
discussion with Linus Torvalds explaining it in detail. On 
windows and Solaris the mechanism is the same.


[1] 
https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962





Re: std.csv Performance Review

2017-06-04 Thread Patrick Schluter via Digitalmars-d

On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:
On Sun, Jun 04, 2017 at 05:41:10AM +, Jesse Phillips via 
(Note that this is much less of a limitation than it seems; for 
example you could use std.mmfile to memory-map the file into 
your address space so that it doesn't actually have to fit into 
memory, and you can still take slices of it. The OS will manage 
the paging from/to disk for you. Of course, it will be slower 
when something has to be paged from disk, but IME this is often 
much faster than if you read the data into memory yourself.


If the file is in the file cache of the kernel, memory mapping 
does not need to reload the file as it is already in memory. In 
fact, calling mmap() changes only the sharing of the pages in 
general. That's where most of the performance win from memory 
mapping comes from.


This stackoverflow [1] discussion links to a realworldtech 
discussion with Linus Torvalds explaining it in detail. On 
windows and Solaris the mechanism is the same.


[1] 
https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962







Re: std.csv Performance Review

2017-06-04 Thread H. S. Teoh via Digitalmars-d
On Sun, Jun 04, 2017 at 05:41:10AM +, Jesse Phillips via Digitalmars-d 
wrote:
> On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:
> > Do you know what happened with fastcsv [0], original thread [1].
> > 
> > [0] https://github.com/quickfur/fastcsv
> > [1] 
> > http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-le...@puremagic.com
> 
> I do not. Rereading that in light of this new article I'm a little
> sceptical of the 51 times faster, since I'm seeing only 10x against
> these other implications.
[...]

You don't have to be skeptical, neither do you have to believe what I
claimed.  I posted the entire code I used in the original thread, as
well as the URLs of the exact data files I used for testing.  You can
just run it yourself and see the results for yourself.

And yes, fastcsv has its limitations (the file has to fit in memory, no
validation is done, etc.), which are also documented up-front in the
README file.  I wrote the code targeting a specific use case mentioned
by the OP of the original thread, so I do not expect or claim you will
see the same kind of performance for other use cases. If you want
validation, then it's a given that you won't get maximum performance,
simply because there's just more work to do.  For data sets that don't
fit into memory, I already have some ideas about how to extend my
algorithm to work with it, so some of the performance may still be
retained. But obviously it's not going to be as fast as if you can just
read the entire file into memory first.

(Note that this is much less of a limitation than it seems; for example
you could use std.mmfile to memory-map the file into your address space
so that it doesn't actually have to fit into memory, and you can still
take slices of it. The OS will manage the paging from/to disk for you.
Of course, it will be slower when something has to be paged from disk,
but IME this is often much faster than if you read the data into memory
yourself. Again, you don't have to believe me: the fastcsv code is
there, just import std.mmfile, mmap the largest CSV file you can find,
call fastcsv on it, and measure the performance yourself. If your code
performs better, great, tell us all about it. I'm interested to learn
how you did it.)

Note that besides slicing, another major part of the performance boost
in fastcsv is in minimizing GC allocations.  If you allocate a string
for each field in a row, it will be much slower than if you either
sliced the original string, or if you allocated a large buffer for
holding the data and just take slices for each field.  Furthermore, if
you allocate a new array per row to hold the list of fields, it will be
much slower than if you allocate a large array for holding all the
fields of all the rows, and merely slice this array to get your rows.
Of course, you cannot know ahead of time exactly how many rows there
will be, so the next best thing is to allocate a series of large arrays,
capable of holding the field slices of k rows, for sufficiently large k.
Once the current array runs out of space, copy the (partial) slices of
the last row to beginning of a new large array, and continue from there.

This way, you will be making n/k allocations, where n is the number of
rows and k is the number of rows that fit into each buffer, as opposed
to n allocations. For large values of k, this greatly reduces the GC
load and significantly speeds things up.  Again, don't take my word for
it. Run a profiler on the fastcsv code and see for yourself.


T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG


Re: std.csv Performance Review

2017-06-03 Thread Jesse Phillips via Digitalmars-d

On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:

Do you know what happened with fastcsv [0], original thread [1].

[0] https://github.com/quickfur/fastcsv
[1] 
http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-le...@puremagic.com


I do not. Rereading that in light of this new article I'm a 
little sceptical of the 51 times faster, since I'm seeing only 
10x against these other implications.


In other news, I'm not sure the validation std.varient.algebraic. 
Csv is doing is worth it.


Re: std.csv Performance Review

2017-06-03 Thread bachmeier via Digitalmars-d

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:

Author here:

The discussion[1] and articles[2] around "Faster Command Line 
Tools" had me trying out std.csv for the task.


Now I know std.csv isn't fast and it allocates. When I wrote my 
CSV parser, I'd also left around a parser which used slicing 
instead of allocation[3].


Do you know what happened with fastcsv [0], original thread [1].

[0] https://github.com/quickfur/fastcsv
[1] 
http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-le...@puremagic.com


Re: std.csv Performance Review

2017-06-03 Thread Johan Engelen via Digitalmars-d

On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:


I compared these two: LDC -O3 -release


Quick note:
Keep in mind that LDC does not do cross-module inlining 
(non-template functions) by default yet. It's good to check 
whether you see big differences with 
`-enable-cross-module-inlining`.


-Johan



std.csv Performance Review

2017-06-02 Thread Jesse Phillips via Digitalmars-d

Author here:

The discussion[1] and articles[2] around "Faster Command Line 
Tools" had me trying out std.csv for the task.


Now I know std.csv isn't fast and it allocates. When I wrote my 
CSV parser, I'd also left around a parser which used slicing 
instead of allocation[3].


I compared these two: LDC -O3 -release

std.csv: over 10 seconds
csv slicing: under 5 seconds

Over 50% improvement isn't bad, but this still wasn't competing 
with the other implementations. Now I didn't profile std.csv's 
implementation but I did take a look at the one with slicing.


Majority of the time was spent in std.algorithm.startsWith, which 
is being called by countUntil. The calls made to empty() also add 
up from the use in countUntil and startsWith. These functions are 
by no means slow, startsWith averaged 1 millisecond execution 
time while countUntil was up to 5 milliseconds; thing is starts 
with was called a whopping 384,806,160 times.


Keep in mind that the file itself has 10,512,769 rows of data 
with four columns. Now I've talked to std.csv's performance in 
the past, probably with the author of the fast command line 
tools. Essentially it came down to std.csv is restricted to 
parsing with only the Input Range api, and you can't correctly 
parse CSV without allocation. But now I'm working outside those 
restrictions and so I offer an additional point.


Both of these do something none of the other implementation do, 
it validates the CSV is well formed. If it finds that the file no 
longer conforms to the correct CSV layout it makes a choice, 
either throw an exception or guess and continue on (based on the 
what the user requested). While the Nim implementation does 
handle escaped quotes (and newlines, unlike fast csv) the parsing 
assumes the file is well formed, which std.csv was quite prompt 
to point out that this file is indeed not well formed.


Even though the issue can be ignored, the overhead of parsing to 
identify issues still remains. I haven't attempted write the 
algorithm assuming proper data structure so I don't know what the 
performance would look like, but I suspect it isn't negligible. 
There is also likely some overhead for providing the tokens 
through range interfaces.


On another note, including this slicing version of the CSV parse 
can and should be included in std.csv as a specialization. But it 
is by no means ready. The feature requirements need to be spelled 
out better (hasSlicing!Range fails for strings but is the primary 
use-case for the optimization), escaped quotes remain in the 
returned data (like I said proper parsing requires allocation).


1. 
http://forum.dlang.org/post/chvukhbscgamxecvp...@forum.dlang.org
2. 
https://www.euantorano.co.uk/posts/faster-command-line-tools-in-nim/

3. https://github.com/JesseKPhillips/JPDLibs/tree/csvoptimize