On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:
Yeah, in the course of this exercise, I found that the one
thing that has had the biggest impact on performance is the
amount of allocations involved. [...snip]
Really interesting discussion.
On Tuesday, 26 January 2016 at 22:36:31 UTC, H. S. Teoh wrote:
...
So the moral of the story is: avoid large numbers of small
allocations. If you have to do it, consider consolidating your
allocations into a series of allocations of large(ish) buffers
instead, and taking slices of the
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
While this is no fancy range-based code, and one might say it's
more hackish and C-like than idiomatic D, the problem is that
current D compilers can't quite optimize range-based code to
this extent yet. Perhaps in the future
On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote:
On Sun, Jan 24, 2016 at 06:07:41AM +, Jesse Phillips via
Digitalmars-d-learn wrote: [...]
My suggestion is to take the unittests used in std.csv and try
to get your code working with them. As fastcsv limitations
would prevent
On Tue, 26 Jan 2016 18:16:28 +, Gerald Jansen wrote:
> On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
>>
>> While this is no fancy range-based code, and one might say it's more
>> hackish and C-like than idiomatic D, the problem is that current D
>> compilers can't quite
On Tuesday, 26 January 2016 at 20:54:34 UTC, Chris Wright wrote:
On Tue, 26 Jan 2016 18:16:28 +, Gerald Jansen wrote:
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
While this is no fancy range-based code, and one might say
it's more hackish and C-like than idiomatic D,
On Tue, Jan 26, 2016 at 08:54:34PM +, Chris Wright via Digitalmars-d-learn
wrote:
> On Tue, 26 Jan 2016 18:16:28 +, Gerald Jansen wrote:
>
> > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> >>
> >> While this is no fancy range-based code, and one might say it's
> >>
On Tuesday, 26 January 2016 at 06:27:49 UTC, H. S. Teoh wrote:
My thought is to integrate the fastcsv code into std.csv, such
that the current std.csv code will serve as fallback in the
cases where fastcsv's limitations would prevent it from being
used, with fastcsv being chosen where
On Sun, Jan 24, 2016 at 06:07:41AM +, Jesse Phillips via
Digitalmars-d-learn wrote:
[...]
> My suggestion is to take the unittests used in std.csv and try to get
> your code working with them. As fastcsv limitations would prevent
> replacing the std.csv implementation the API may not need to
On Fri, Jan 22, 2016 at 10:04:58PM +, data pulverizer via
Digitalmars-d-learn wrote:
[...]
> I guess the next step is allowing Tuple rows with mixed types.
Alright. I threw together a new CSV parsing function that loads CSV data
into an array of structs. Currently, the implementation is not
On Sunday, 24 January 2016 at 01:57:11 UTC, H. S. Teoh wrote:
- Ummm... make it ready for integration with std.csv maybe? ;-)
T
My suggestion is to take the unittests used in std.csv and try to
get your code working with them. As fastcsv limitations would
prevent replacing the std.csv
On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via
Digitalmars-d-learn wrote:
[...]
> > https://github.com/quickfur/fastcsv
[...]
Fixed some boundary condition crashes and reverted doubled
quote handling in unquoted
On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via
Digitalmars-d-learn wrote:
[...]
> > https://github.com/quickfur/fastcsv
[...]
Fixed some boundary condition crashes and reverted doubled
quote handling in unquoted
On Friday, 22 January 2016 at 21:41:46 UTC, data pulverizer wrote:
On Friday, 22 January 2016 at 02:16:14 UTC, H. S. Teoh wrote:
[...]
Hi H. S. Teoh, I have used you fastcsv on my file:
import std.file;
import fastcsv;
import std.stdio;
import std.datetime;
void main(){
StopWatch sw;
On Fri, Jan 22, 2016 at 10:04:58PM +, data pulverizer via
Digitalmars-d-learn wrote:
[...]
> >$ dmd file_read_5.d fastcsv.d
> >$ ./file_read_5
> >Time (s): 0.679
> >
> >Fastest so far, very nice.
Thanks!
> I guess the next step is allowing Tuple rows with mixed types.
I thought about that
On Friday, 22 January 2016 at 01:36:40 UTC, cym13 wrote:
On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote:
And now that you mention this, RFC-4180 does not allow doubled
quotes in an unquoted field. I'll take that out of the code
(it improves performance :-D).
Right, re-reading
On Thursday, 21 January 2016 at 10:40:39 UTC, data pulverizer
wrote:
On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole
wrote:
Okay without registering not gonna get that data.
So usual things to think about, did you turn on release mode?
What about inlining?
Lastly how about
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer
wrote:
StopWatch sw;
sw.start();
auto buffer = std.file.readText("Acquisition_2009Q2.txt");
auto records = csvReader!row_type(buffer, '|').array;
sw.stop();
Is it csvReader or readText that is slow? i.e. could you move
On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote:
On 01/21/2016 02:40 AM, data pulverizer wrote:
dmd -release -inline code.d
These two as well please:
-O -boundscheck=off
the ingest of files and
speed of calculation is very important to me.
We should understand why D is
On 01/21/2016 02:40 AM, data pulverizer wrote:
dmd -release -inline code.d
These two as well please:
-O -boundscheck=off
the ingest of files and
speed of calculation is very important to me.
We should understand why D is slow in this case. :)
Ali
I have been reading large text files with D's csv file reader and
have found it slow compared to R's read.table function which is
not known to be particularly fast. Here I am reading Fannie Mae
mortgage acquisition data which can be found here
On Thursday, 21 January 2016 at 10:20:12 UTC, Rikki Cattermole
wrote:
Okay without registering not gonna get that data.
So usual things to think about, did you turn on release mode?
What about inlining?
Lastly how about disabling the GC?
import core.memory : GC;
GC.disable();
dmd -release
On 21/01/16 10:39 PM, data pulverizer wrote:
I have been reading large text files with D's csv file reader and have
found it slow compared to R's read.table function which is not known to
be particularly fast. Here I am reading Fannie Mae mortgage acquisition
data which can be found here
On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen
wrote:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer
wrote:
StopWatch sw;
sw.start();
auto buffer =
On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van
Leeuwen wrote:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer
wrote:
StopWatch sw;
On Thursday, 21 January 2016 at 13:42:11 UTC, Edwin van Leeuwen
wrote:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer
wrote:
StopWatch sw;
sw.start();
auto buffer = std.file.readText("Acquisition_2009Q2.txt");
auto records = csvReader!row_type(buffer, '|').array;
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer
wrote:
On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das
Actually since you're aiming for speed, this might be better:
sw.start();
auto records =
On Thursday, 21 January 2016 at 16:25:55 UTC, bachmeier wrote:
On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer
wrote:
Running Ubuntu 14.04 LTS
In that case, have you looked at
http://lancebachmeier.com/rdlang/
If this is a serious bottleneck you can solve it with two lines
On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer
wrote:
On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:
Interesting that reading a file is so slow.
Your timings from R, is that including reading the file also?
Yes, its just insane isn't it?
It is insane. Earlier
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer
wrote:
On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 14:32:52 UTC, Saurabh Das
wrote:
[...]
Actually since you're aiming for speed, this might be better:
sw.start();
auto records
On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:
Interesting that reading a file is so slow.
Your timings from R, is that including reading the file also?
Yes, its just insane isn't it?
On Thursday, 21 January 2016 at 11:08:18 UTC, Ali Çehreli wrote:
We should understand why D is slow in this case. :)
Ali
fread source is here:
https://github.com/Rdatatable/data.table/blob/master/src/fread.c
Good luck trying to work through that (which explains why I'm
using D). I don't
On Thursday, 21 January 2016 at 10:48:15 UTC, data pulverizer
wrote:
Running Ubuntu 14.04 LTS
In that case, have you looked at
http://lancebachmeier.com/rdlang/
If this is a serious bottleneck you can solve it with two lines
evalRQ(`x <- fread("Acquisition_2009Q2.txt", sep = "|",
On Thursday, 21 January 2016 at 15:17:08 UTC, data pulverizer
wrote:
On Thursday, 21 January 2016 at 14:56:13 UTC, Saurabh Das wrote:
@Edwin van Leeuwen The csvReader is what takes the most time,
the readText takes 0.229 s
The underlying problem most likely is that csvReader has (AFAIK)
On Thursday, 21 January 2016 at 17:17:52 UTC, Saurabh Das wrote:
On Thursday, 21 January 2016 at 17:10:39 UTC, data pulverizer
wrote:
On Thursday, 21 January 2016 at 16:01:33 UTC, wobbles wrote:
Interesting that reading a file is so slow.
Your timings from R, is that including reading the
On Thu, 21 Jan 2016 18:37:08 +, data pulverizer wrote:
> It's interesting that the output first array is not the same as the
> input
byLine reuses a buffer (for speed) and the subsequent split operation
just returns slices into that buffer. So when byLine progresses to the
next line the
On Thursday, 21 January 2016 at 18:31:17 UTC, data pulverizer
wrote:
Good news and bad new. I was going for something similar to
what you have above and both slash the time alot:
Time (s): 1.024
But now the output is a little garbled. For some reason the
splitter isn't splitting correctly -
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer
wrote:
I have been reading large text files with D's csv file reader
and have found it slow compared to R's read.table function
This great blog post has an optimized FastReader for CSV files:
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
[...]
It may be fast but I think it may be related to the fact that
this is not a CSV parser. Don't get me wrong, it is able to parse
a format defined by delimiters but true CSV is one hell of a
beast. Of course most data look
On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 11:29:49PM +, data pulverizer via
Digitalmars-d-learn wrote:
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
>On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via
>This piqued my
On Thursday, 21 January 2016 at 22:13:38 UTC, Brad Anderson wrote:
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
[...]
What about wrapping the slices in a range-like interface that
would unescape the quotes on demand? You could even set a flag
on it during the initial pass
On Thursday, 21 January 2016 at 22:20:28 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 10:09:24PM +, Jon D via
Digitalmars-d-learn wrote: [...]
FWIW - I've been implementing a few programs manipulating
delimited files, e.g. tab-delimited. Simpler than CSV files
because there is no
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer
wrote:
I have been reading large text files with D's csv file reader
and have found it slow compared to R's read.table function
which is not known to be particularly fast.
FWIW - I've been implementing a few programs manipulating
On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> >[...]
>
> It may be fast but I think it may be related to the fact that this is
> not a CSV parser. Don't get me wrong, it is able to parse a format
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
[snip]
There are some limitations to this approach: while the current
code does try to unwrap quoted values in the CSV, it does not
correctly parse escaped double quotes ("") in the fields. This
is because to process those values
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via
This piqued my interest today, so I decided to take a shot at
writing a fast CSV parser. First, I downloaded a sample large
CSV file from: [...]
Hi H. S. Teoh, I
On Thu, Jan 21, 2016 at 11:29:49PM +, data pulverizer via
Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> >On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via This piqued
> >my interest today, so I decided to take a shot at writing a
On Thu, Jan 21, 2016 at 07:11:05PM +, Jesse Phillips via
Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer wrote:
> >R takes about half as long to read the file. Both read the data in
> >the "equivalent" type format. Am I doing something incorrect
On Thursday, 21 January 2016 at 20:46:15 UTC, Gerald Jansen wrote:
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer
wrote:
I have been reading large text files with D's csv file reader
and have found it slow compared to R's read.table function
This great blog post has an
On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
Of course, running without GC collection is not a fair
comparison with std.csv, so I added an option to my benchmark
program to disable the GC for std.csv as well. While the
result was slightly faster, it was still much slower
On Thu, Jan 21, 2016 at 10:09:24PM +, Jon D via Digitalmars-d-learn wrote:
[...]
> FWIW - I've been implementing a few programs manipulating delimited
> files, e.g. tab-delimited. Simpler than CSV files because there is no
> escaping inside the data. I've been trying to do this in relatively
>
On Thursday, 21 January 2016 at 23:58:35 UTC, H. S. Teoh wrote:
are there flags that I should be compiling with or some other
thing that I am missing?
Did you supply a main() function? If not, it won't run, because
fastcsv.d is only a module. If you want to run the benchmark,
you'll have to
On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> >[...]
>
> It may be fast but I think it may be related to the fact that this is
> not a CSV parser. Don't get me wrong, it is able to parse a format
On Friday, 22 January 2016 at 01:27:13 UTC, H. S. Teoh wrote:
And now that you mention this, RFC-4180 does not allow doubled
quotes in an unquoted field. I'll take that out of the code (it
improves performance :-D).
Right, re-reading the RFC would have been a great thing. That
said I saw
On Thu, Jan 21, 2016 at 04:31:03PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
> On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn
> wrote:
[...]
> > https://github.com/quickfur/fastcsv
>
> Oh, forgot to mention, the parsing times are still lightning fast
>
On Fri, Jan 22, 2016 at 01:13:07AM +, Jesse Phillips via
Digitalmars-d-learn wrote:
> On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote:
> >but in that case external quotes aren't required:
> >
> >number,name,price,comment
> >1,Twilight,150,good friend
> >
On Thu, Jan 21, 2016 at 04:26:16PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
> On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via Digitalmars-d-learn wrote:
> > On Thursday, 21 January 2016 at 21:24:49 UTC, H. S. Teoh wrote:
> > >[...]
> >
> > It may be fast but I think it may be related
On Friday, 22 January 2016 at 00:26:16 UTC, H. S. Teoh wrote:
On Thu, Jan 21, 2016 at 11:03:23PM +, cym13 via
Digitalmars-d-learn wrote:
[...]
Alright, I decided to take on the challenge to write a "real"
CSV parser... since it's a bit tedious to keep posting code in
the forum, I've
On Thursday, 21 January 2016 at 23:03:23 UTC, cym13 wrote:
but in that case external quotes aren't required:
number,name,price,comment
1,Twilight,150,good friend
2,Fluttershy,142,gentle
3,Pinkie Pie,169,He said ""oh my gosh""
std.csv will reject this. If validation is turned
On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote:
Great! Sorry for the separator thing, I didn't read your code
carefully. You still lack some things like comments and surely
more things that I don't know about but it's getting there. I
didn't think you'd go through the trouble of
On Fri, Jan 22, 2016 at 12:56:02AM +, cym13 via Digitalmars-d-learn wrote:
[...]
> Great! Sorry for the separator thing, I didn't read your code
> carefully. You still lack some things like comments and surely more
> things that I don't know about but it's getting there.
Comments? You mean in
On Friday, 22 January 2016 at 01:14:48 UTC, Jesse Phillips wrote:
On Friday, 22 January 2016 at 00:56:02 UTC, cym13 wrote:
Great! Sorry for the separator thing, I didn't read your code
carefully. You still lack some things like comments and surely
more things that I don't know about but it's
On Thu, Jan 21, 2016 at 04:50:12PM -0800, H. S. Teoh via Digitalmars-d-learn
wrote:
> [...]
> > > https://github.com/quickfur/fastcsv
[...]
Fixed some boundary condition crashes and reverted doubled quote
handling in unquoted fields (since those are illegal according to RFC
4810). Performance
On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear wrote:
On Thu, 21 Jan 2016 18:37:08 +, data pulverizer wrote:
It's interesting that the output first array is not the same
as the input
byLine reuses a buffer (for speed) and the subsequent split
operation just returns slices
On Thursday, 21 January 2016 at 09:39:30 UTC, data pulverizer
wrote:
R takes about half as long to read the file. Both read the data
in the "equivalent" type format. Am I doing something incorrect
here?
CsvReader hasn't been compared and optimized from other CSV
readers. It does have
On Thursday, 21 January 2016 at 19:08:38 UTC, data pulverizer
wrote:
On Thursday, 21 January 2016 at 18:46:03 UTC, Justin Whear
wrote:
On Thu, 21 Jan 2016 18:37:08 +, data pulverizer wrote:
It's interesting that the output first array is not the same
as the input
byLine reuses a buffer
66 matches
Mail list logo