I finally wrote a quick wrapper for gzfile that looks a lot like nim's File.
You can use like
import gzfile
import strutils
var vcf:GZFile
doAssert vcf.open("test.vcf.gz")
for line in vcf.lines:
if not line.startsWith('#'):
# do something
It looks like `nim-faststreams` is just to bridge the API gap between mmap IO
and streams interfaces which is nice and all, but won't help for `startProcess`
or other `popen`-like contexts where in this discussion streams slowness was a
problem. { @mratsim didn't say it would, exactly, but I
We have an alternative streams implementation that uses memfiles for speed:
[nim-faststreams](https://github.com/status-im/nim-faststreams)
The slowness of the Nim **streams** library is frustrating, but I've learned to
avoid `FileStream`. I simply readAll() into memory and then use `StringStream`,
where the unbuffered implementation is fine.
hi @markebbert, re Zstd and htslib, see:
[https://github.com/samtools/htslib/issues/530](https://github.com/samtools/htslib/issues/530)
ping me if you have any questions on hts-nim.
Thanks @jyapayne. It's nice to have this code I can look back on for future use.
@brentp, fancy seeing you around here. I blame you for starting me on this
little journey. :-P
Thanks for pointing me towards the multi-threaded `.gz` decompression in
`hts-nim`. I will definitely use that. Did
@markebbert Good catch! Thanks for debugging :P
Now that you mention it, the whole section
if data[last] == '\l':
buffer.add data[pos+1 ..< pos+bufSize]
else:
buffer.add data[pos ..< pos+bufSize]
Run
can just be replaced with
buffer.add
just to add my 0.02 here as I happened on this and I have avoided using gzip
stuff from nim as it is too slow. Mark (hi!), I know you linked to
[hts-nim]([https://github.com/brentp/hts-nim](https://github.com/brentp/hts-nim))
but that will give you multi-threaded decompression for bgzipped
Thanks @jyapayne. I think we (i.e., you) are really close. The first character
of each line that exceeds the buffer was getting cut off (or maybe if the prior
line exceeded buffer?). Looks like we're off by one at the same spot. I believe:
buffer.add data[pos+1 ..< pos+bufSize]
@markebbert, yes you are right! That should be
if data[last] == '\l':
buffer.add data[pos+1 ..< pos+bufSize]
else:
buffer.add data[pos ..< pos+bufSize]
pos += bufSize
Run
Which will account for the buffer increase. So the code now for the `lines`
For what it's worth, and for completeness if Windows portability even matters
in this case (as @markebbert mentioned, these science things are often one time
deals), this works but is 6x slower (405 sec aka 6min 45sec) than the
`popen`/`mSlices` variant:
import strutils, osproc,
@markebbert, No problem! No need to explain yourself. I asked for feedback :)
It's all for the service of making things better!
As for your comment, though, I don't think it is duplicating lines. I extracted
the file and compared the output (adding an echo statement to the code) and
they are
@jyapayne, I don't mean to be a backseat driver, and I realize you're doing
this on the fly in your 'spare' time, but I think your updated code is now
duplicating lines (without a newline). For example, I'm seeing close to 400k
columns on the final header column that identifies sample IDs
You're welcome.
@jyapayne \- Well, there is this
import strutils, posix
proc main() =
for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)):
if line.startsWith('#'):
continue
var i = 0
for col in line.split('\t'):
@cblake,
Thank you for all of the detailed comparisons, and thanks for pointing out the
counter bug. I did catch that yesterday, but it didn't make a meaningful
difference for the timing. I plan to update it in my original post.
Thanks for pointing me towards `Zstd`. That's remarkable, and
@markebbert Here is a rough version of the working code. Unfortunately, the
speed is quite a bit slower. And it does indeed look like most of the time is
spent in the split function, because without it, this code runs in ~39 seconds.
With it, it takes about 7 minutes on my machine.
After `-d:danger -d:release` and gcc-9.2 on an i7-6700k at 4.8GHz this runs in
about 46 seconds for me against the decompressed file in a RAM filesystem:
import cligen/[mfile, mslice]
proc main() =
for line in mSlices(mopen("big.vcf")):
if line.len > 0 and line[0]
@markebbert, yes I've actually made quite a serious error in the code I posted.
The way streams work is a bit different from files and I can't actually do what
I did to the buffer size. What actually happens is that the code reads in 100
MB of file, parses up to one line, and then throws the
I'm glad to see that my silly compiling error has proven useful in an
unintended way. :-)
@jyapayne, I was adding in my other logic but I noticed some lines were
missing. I went back to the minimal example and echoed the line right inside
the for loop. It prints the first header line
Because various Zstd ratios are all so large, it helps in many practical
circumstances more than choice of programming language (which often plays in
the 2x-5x range). Continuing with that one data file example, with just 4 cores
the output rate is 7.3 GB/s. On an otherwise idle 16 core system,
@cblake damn! I had no idea zstd was so awesome. Thanks for those numbers. I
will definitely use that for compression in the future. Also your mmap code
using cligen is really nice. I'll probably use it for parsing files in the
future ;)
@rayman22201 I'll file a PR for the problem. Just gotta
Note:
It's not `--d:release` or `-d:danger`, it can be both.
`-d:release` turns on GCC/Clang/MSVC `-O3` optimizations and removes
stacktraces. `-d:danger` turns off runtime checks like asserts, checking that
all array or string accesses are within bounds, that your `Natural` is >= 0 and
so
Streams in the stdlib definitely need some optimization love. I've known about
the char by char readline problem forever, and just never got around to fixing
it. See here:
@markebbert I had a crack at making your script faster. Some std libs need to
be altered, but here's the work in a standalone file:
import zip/gzipfiles
import strutils
# This is from lib/pure/streams.nim. Optimized to take a buffer
proc readStr*(s: Stream, length:
> There are so many cool languages, but learning a new language and all of its
> nuances can be so time consuming.
If you have some basic CS background, then the effort learning Nim is minimal,
it is basically reading Tutorial 1 and 2 only, and maybe some of the other free
resources listed on
@jyapayne \- not according to him. :-) But your two messages clearly passed in
flight. As Araq mentioned, that `gzipfiles` module may need some work. You
should do it! Personally, I have been avoiding gzip since at least 2007. There
are just much better algos & tools now across every dimension
@siloamx, thank you for pointing out that Python list comprehensions are faster
than loops! I only recently started using them for simple list manipulations
(e.g., vector math), but I did not know they were faster. Also looks like you
can use functions within them, which makes them useful for
You're welcome.
The usual behavior for compilers is to generate slow code about as quickly as
possible, maybe also with the best debuggability, and do things like add
`-O`/`-O2` to get faster code generated more slowly. I don't think Nim should
vary from that. I am not entirely sure how much
Would not it be good to make at least d:release the default compile option ? I
get nervous every time someone posts a perf issue like this as I do not know
whether this would affect any of my code or not.
Thank you @cblake for testing this out.
I don't know. Thanks for the implicit compliments and all, but it's an awfully
specific set of circumstances, not a general performance analysis even of VCF
never mind DSV parsing. The performance will be closely related to how long
various column substrings are. That will be specific to the
Well, I don't really have a blog. So, this is what you get. ;-) Someone else
can, though. Ideally, just give me credit by linking back to here. Or if you
can make any of that `cligen/mslice.*split*` faster then a PR is welcome.
As a slight update, storing `big.vcf` in `/tmp` (a tmpfs aka
@cblake maybe a worthy post to the nim homepage blog:
[https://nim-lang.org/blog.html](https://nim-lang.org/blog.html) ?
FWIW, I suspect the answer to all this noise is that @markebbert was simply not
using an optimized compile (as suggested by the very first line of the very
first response to him).
@jyapayne \- what I did was go to
[https://vcftools.github.io/index.html](https://vcftools.github.io/index.html)
Mark, Do you have a sample file that we can use to try to optimize the code?
I'd like to try but I'd need to test it out on a real file.
Nim is usually among the fastests languages around. Occasionally you run into a
shockingly slow proc without warnings in its documentation. Having a large set
of language/stdlib benchmarks could really help but so far there's been little
interest creating and maintaining such set.
I want to encourage on your journey with Nim. Nim can be a challenge especially
if you are coming from a dynamic language like Python. Because Nim has a nice
and reasonably friendly syntax for most cases. It is not a Python-like
language. You can not approach Nim like Python. It is a system's
On slightly closer inspection of that spec, it seems that backslash quoting
only happens in the `##` comment sections ignored above. So, maybe they aren't
buggy after all if that data is not important to the calculation in question.
A further point along the lines of "if we're going to provide
1. I suspect what's really slow here is Nim's IO or unzipping and this should
definitely be looked into and fixed.
2. Well, yes, strutils is for quick & dirty hacking, not for "quick
runtimes". And optimizing splits is easier said than done, effectively you need
2 different versions so that
In fact, that VCF format does use `\` escapes
([https://samtools.github.io/hts-specs/VCFv4.3.pdf)](https://samtools.github.io/hts-specs/VCFv4.3.pdf\)).
So, basically all the above code examples are indeed wrong, and @Araq's advice
is the right general advice (I did say they should probably
You can optimize it all you want, in the end it's still naive because it's
**wrong** : What if data spans multiple lines? What if there is some escape
mechanism via `\`? What if the data can be encoded via `%xx` (byte in hex).
There are not many file formats around that have no escape/quoting
I agree with @siloamx. I would also point out that for many less
compiler/parsing-sophisticated programmers, splitting is conceptually simple
enough to make all the difference. Such programmers may never even have heard
of "lexing". This is just to re-express @Araq's point about it being
Maybe
> 'Parsing with split is naive'
but it will be very common esp. among users who migrate from Python/Javascript.
It would be nice if you could optimize it.
I use `lexbase` and `strscans` modules and never had performance problems.
Parsing with `split` is naive, it hardly works, so we never really optimized it.
I just wrote an obj loader for my game last week, obj files are text files that
describes a 3d model. To stress test it I tried the common large minecraft
model rungholt, this is a text file with over 9 million lines. Blender took 2
minutes to load the file and windows preview gave up. I load
Have you compiled your Nim program with option -d:release or -d:danger?
Just to ensure that it is not a default debug built.
Of course Nim should be not slower than Python, but I guess it may be not
really faster for plain string split. The reason is, that these basic
operations are generally
Hello,
I'm new to Nim, but was tempted to give it a go because I've heard it has the
simplicity of Python and the speed of C. I sat down to write my first Nim
script on last week, where I mimicked a script I had already written in Python.
I was excited to see just how fast Nim would be. The
46 matches
Mail list logo