Re: Dictionary syntax
In other words, it seems "excessive" to you because you do not imagine that you would never use anything but the stdlib `Table` "behind" `{}`. Never switching out something like that is "famous last words" in some circles. :-) So, the verbosity guards against the regret of hard-coding.
Re: Dictionary syntax
The `{}` syntax is what the Lisp world calls an "association-list" and is more general than what the Python world calls a dictionary literal/constructor syntax. In Nim `{a:b,c:d, ...}` is just syntactic sugar for `[ (a,b), (c,d), ... ]` (or `@[]` if you prefer a seq analogy, but it's at compile time when either seq or array lengths are "known"). This is a good in that you can use the same syntax for multiple back-end associative array implementations (like a simple sorted array, trees, alternate hash tables, etc.). In a sense the syntax is only loosely coupled to the semantics here. The cost is the slight extra verbosity of tacking on a `.toTable`, `toSortedArray`, `.toBST`, `.toBTree` or whatever which is deemed an acceptable trade off for the generality. Nim is not Python and you do have to put types/type-related things in various places.
Re: NvP: s.add('x') 100M times
For this particular benchmark `--gc:boehm` uses the least memory and time for me on nim 28510a9da9bf2a6b02590ba27b64e951a208b23d with gcc-10.1 and PGO but that least is still 2.5x the RSS of python-2.7.18. Not sure why, but yeah it is 35x faster than Python.
Re: NvP: s.add('x') 100M times
I don't disagree. Might need delving into the generated C to figure out, but I'm guessing my results are not hard to reproduce. If they are let me know how I can best help.
Re: NvP: s.add('x') 100M times
Just did a non-PGO regular `-d:danger` run. Times went up 1.9x but memory usage patterns were the same with `gc:arc` using much more RSS than `gc:boehm` or `gc:markAndSweep`. It's a pretty tiny program.
Re: NvP: s.add('x') 100M times
@cumulonimbus \- I tried that. Didn't alter the behavior I was seeing. If this behavior was not always there then my guess is that some arc bug was causing a crash, got fixed, and now the fix causes this. Regardless of whether it was always there or appeared by bug-jello-squishing accident as I theorize, we should probably have a little suite of "memory use regression" tests to prevent stuff like the scenario I described. Such a suite would be a kind of "correctness testing" for deterministic memory management. Could have a "fuzzy/ball park compare". Maybe we have such already, perhaps informally? If so, we should add this `str1` to it. If not, it can be the first test. :-)
Re: NvP: s.add('x') 100M times
Yup. Just what I was seeing, @b3liever. No `main()`-difference to the RSS delta, and a very noticable delta in a non-intuitive direction. So, either our intuitions are wrong in a way which should be clarified or there's a problem which should be fixed. Maybe a github issue?
Re: NvP: s.add('x') 100M times
I don't see how your linked algo explains deltas across gcs if that `3 div 2` growth happens for all of them. The memory `gc:arc` uses here seems more like the sum of all prior allocations, not "up to 1.5x what's needed". Actually, `gc:arc` uses 1.25x more mem than `gc:none` (312MB) in a test I just tried.
Re: NvP: s.add('x') 100M times
I also usually find Py2 much faster than Py3. Pypy usually helps. Cython more so, but with much more work. Anyway, the obvious better way to do it in Nim (which I always assumed was "never the point") is var s = newStringOfCap(100_000_001) # or whatever for i in 0..100_000_000: s.add('x') echo len(s) Run which runs 2x as fast as otherwise and uses exactly the right amount of memory. I mention it just in case @HashBackupJim was unaware.
Re: NvP: s.add('x') 100M times
@HashBackupJim \- `newSeqOfCap[T](someLen)` also exists and, yes, pre-sizing can help a lot in Nim (and almost any lang that supports it). Profile-guided-optimization at the gcc level can also help Nim run timings a lot..In this case 1.6x to 1.9x for various `gc` modes. [https://forum.nim-lang.org/t/6295](https://forum.nim-lang.org/t/6295) explains how. LTO also helps since most of the boost of PGO is probably from well chosen inlining. @sschwartzer \- not only string benchmarks...Interpreter start-up/etc. Anyway, this isn't a Python forum, and benchmarking "always depends". :-) :-) Someone else should reproduce my `--gc:arc` uses more memory than `gc:none` for the original str1.nim or one with a `main()` (or both). I think this kicking the tires has probably uncovered a real problem.
Re: NvP: s.add('x') 100M times
The same problem happens on nim-1.2.0 (well, nim-devel git hash ed44e524b055b985b0941227a61f84dd8fdfcb37). So, this is a long-lived, maybe since the beginning behavior of `gc:arc` (but we should still have a memory overuse regression test). Probably time to look at generated C.
Re: NvP: s.add('x') 100M times
One other Nim-level thing I can say is that things work as expected for `seq[int]` of the same memory scale (100MB). I.e., proc main() = var s: seq[int] for i in 0..12_500_000: s.add 1 echo len(s) main() Run produces a memory report (using /usr/bin/time on Linux) like: 187MB seq2-arc 250MB seq2-default 250MB seq2-msweep 265MB seq2-boehm 300MB seq2-none Run So, this problem is **_only_** for Nim `string`. Indeed, if one changes `string` to `seq[char]` in the original example, usage goes down to 139MB, roughly what one would expect for a 3/2 growth policy.
Re: NvP: s.add('x') 100M times
I was mistaken. I was compiling my `seq` test with `-d:useMalloc` which fixes the problem. Sorry..fiddling with too many knobs. `string` and `seq[char]` behave identically with `gc:arc` and both get fixed (139MB) with `--gc:arc -d:useMalloc`. Other GCs (including `none`) still beat `gc:arc`-without-`useMalloc` on `string`. However, other gc's spend more memory (like 420MB) than `gc:arc` on `seq[char]`. So, whatever is going on, `seq` is actually worse than `string`, not better. { But also a side note for @HashBackupJim to try `-d:useMalloc` with `--gc:arc`. } At this point, I should raise an issue over at Github. I'll link it back here.
Re: NvP: s.add('x') 100M times
Github issue: [https://github.com/nim-lang/Nim/issues/14811](https://github.com/nim-lang/Nim/issues/14811)
Re: Why does wrapping the code in a top level procedure make it faster?
@Stefan_Salewski \- The idea of defaulting to an optimizing mode has come up before. It seems contrary to what almost any user coming from a "compiled language" world would expect. For various reasons (debugability, compile-time performance, etc.), almost all compilers default to a non-optimized (or very weakly optimized) output and allow various knobs to crank up optimization, as does the current `nim` setup. There is even a whole dark art of "best optimization flags" which can be taken [to severe extremes](https://github.com/Acovea/libacovea). More simply/efficiently/some might say intelligently, you can often use PGO [https://forum.nim-lang.org/t/6295](https://forum.nim-lang.org/t/6295) to get 1.25..2.0x boosts on object code generated from nim-generated C. Some flags like `-ffast-math` can even change semantics in subtle ways that can impact correctness or not depending on use cases. I don't know what to do about people publicizing misleading benchmarks. That seems an ineliminable hazard, not only for Nim, but actually **_everywhere and all the time_** , and often not on purpose (though most "marketing" wants to do strawman comparisons on purpose). Besides compiling wrong, they could also use a bad algorithm, misrepresentative inputs, weird machines, benchmarks naive relative to intended measurements, and probably several other mistake categories. :-/ The best strategy may be trying to be supportive where & when we can, educating as we go, though I agree/admit that is a neverending battle.
Re: Why does wrapping the code in a top level procedure make it faster?
Ah. Sorry that I misunderstood the sarcasm. Maybe my content/xref was not worthless anyway. ;-)
Re: Why does wrapping the code in a top level procedure make it faster?
As the person who added `rightSize`, I agree with those points actually. The only slight gotcha is that doing as you suggest would result in old code that calls it with (the correct power of two to avoid a resize, or maybe already a `rightSize`) would allocate the _next higher_ power, wasting space. That might be a price worth paying ease-of-use wise. We _could_ only call `rightSize` if the argument is not a power of two, but it's probably simpler to just always use `rightSize` and just tell people. Wasting space isn't quite a "failure". Anyway, you should feel free to file a github issue and/or work on a pull request. Maybe I should have pushed harder at the time to change the parameter semantics. There is definitely a learning curve coming from other languages. There are definitely more examples of the stdlib not providing the "best in class" for various functionalities. There will likely be even more with time. Nim core only has so much manpower to write/maintain/support such things. We discussed doing a "distribution" over at [https://github.com/nim-lang/RFCs/issues/173](https://github.com/nim-lang/RFCs/issues/173) and elsewhere. This "fusion" repository (see end of that thread) seems to have been the output, but not much has been happening over there. Anyway, I think if you persevere there are good chances you could be a happy Nimmer once you've learned your way around. You seem pretty capable of finding these issues and you could probably help round some of the sharp edges, if you can sustain the patience. There is probably a Porting to Nim From Python guidebook or maybe you could help do one!
Re: Why does wrapping the code in a top level procedure make it faster?
In my experience batteries are almost never fully charged and it's hard to get feedback if you only release a perfect cathedral. With no feedback, it's kind of unlikely you will build something popular. To take a topical example, even after 30 years of tuning, the core language `dict` in Python still seems to have no easy way to "pre-size" an instance. There is no power of 2, no `rightSize`, nuthin'. So, one of your rough edges literally cannot arise due to an inflexibility/API design flaw (IMO). Yes, there must be 3rd party replacements or ways to pre-size in some slow subclass or whatever, but you could also just write a `proc initTab` that always calls `rightSize` for your own code. What's at issue here is "out of the box" (and personally I think the Nim workaround is less onerous than workarounds in the Python world). Do you have to learn your way around? Sure. A sales pitch of "just like Python but compiled/with native types" is definitely not quite right. That's Cython/similar. Analogies and oversimplifications help some people while others want the real story, and you strike me as the latter. Nim gives programmers more power and flexibility, but responsibilities also come with that. Cue Spider Man's Uncle Ben. ;-) It is yet a comparatively tiny community. Bugs, workarounds, and rough edges come with that. Unused patterns like your `&` are just unused, unoptimized things waiting to be discovered/fixed. There was a time when `C` had no proper namespacing of struct members and that is why some core Unix types have prefixes like the `st_*` in `struct stat` (run "man 2 stat" on any unix machine). No one using Nim _wants_ it to be hard, but there may also be good reasons some things are the way they are (often flexibility, performance, or safety, but yes sometimes neglect). I'm really sorry to hear your entry has been tough, but the flip side of that is you could probably make a HUGE impact to future similar-but-not-quite-you's, and I again encourage you to try! Even just documenting everything somewhere would probably help at least a few other Python Refugees. :-) Cheers and best wishes whatever you decide.
Re: Introducing --gc:arc
You should also try PGO both with gcc and clang. [https://forum.nim-lang.org/t/6295](https://forum.nim-lang.org/t/6295) { yes, I know you posted that. :-) xref for others. :-) }
Re: Python transpiler
@ShalokShalom can perhaps clarify what he meant, but I took him to be asking for a `c2nim`-like program `python2nim` to help facilitate porting pure Python code to pure Nim code (as opposed to using already written Nim code from within Python or already written Python code from within Nim). Some runtime libraries to bridge API differences might also help along those lines. Some quick web searchers turn up a few partial moves along this direction.
Re: Python transpiler
I think part of the "sales pitch" for Nim out in the world, for good or ill, is "sorta like Python but compiled with static types". It isn't so strange that people show up asking how to automate "sorta like" into "more so". ;-) The vast majority of Python code that I have seen uses precious little of Python's very dynamic nature. So, doing a half-baked py2nim with a **_slew_** of caveats might have some value. It's not like `c2nim` is perfect, either. The questions are more A) if the caveats are beyond the grasp of the typical person porting such code and B) if the rather open-ended nature (ever changing Py3) would leave it half-done forever. @juancarlospaco may be right that a group effort here could yield the best results. It isn't my question to answer, but as to "why?", it is likely "easier good performance". Python unassisted by C/Cython/Pythran/etc. or Pypy is one of the slowest programming languages around. Combined with piles and piles of Python code out there doing xyz useful thing that could be safer/faster/etc. with Nim. Nim just invoking/calling that code is one way to access functionality. If it was too slow to begin with, it won't help. People may have misplaced expectations about how little they have to think in order to improve performance, of course.
Re: Macro to create a dictionary (table) like in python!
I would have to agree with `@`. +1 vote.
Re: Macro to create a dictionary (table) like in python!
These association lists, often abbreviated "alist", get a lot of play in lisp, but also OCaml, Haskell, etc.. [https://en.wikipedia.org/wiki/Association_list](https://en.wikipedia.org/wiki/Association_list)
Re: Hyphens Not Allowed in Nim Filenames? [Invalid Module Name]
I think it would be good to allow the quoted form for consistency's sake. Also, there should probably be a `{.out: "whatever-name"}` like the `{.passC.}` etc. pragmas to control the output name from within the source file. (Maybe there is now, but last I looked there wasn't. Maybe `outFile` or `outExe` are better names.)
Re: Hyphens Not Allowed in Nim Filenames? [Invalid Module Name]
(consistency with what is possible for identifiers in other contexts, in case my intent wasn't clear..)
Re: Hyphens Not Allowed in Nim Filenames? [Invalid Module Name]
Saying you would recommend against use without saying why you would and then saying that alone is a good enough reason is not much of an argument, actually. So, why would you recommend against its usage? Kebab-case is quite popular for program names in the Unix world. Fully 20% of my `/usr/bin` programs have a `'-'` in them (about 3X the number that use underscore). (These are not programs named by me.) Also popular is having the name of the program match the name of the source file up to an extension. So, if you mean to say this 75% super majority of dash vs underscore program namers is so wrong-headed, what exactly have they got wrong? As for "dealing with code" that uses it, use in library/module names as opposed to program names is unlikely to be common. That said, tossing backticks on there doesn't sound very much of a burden..It might be less onerous in some more system-wide sense than having a "library version" and a "program version" or symlinks or other kinds of aliasing some people might concoct to work around the problem. I think having a `{.out:.}` pragma to let a program set its own (default, overridable on the `nim c` command-line) name is a good step, too. That might solve most people's problems. It would still seem lamely inflexible that module names are more restricted than other identifier names (procs, vars, etc.).
Re: What do you think about the programming language NIM?
@mratsim's code may not trigger it, but at least in `/devel` there seems to be a check in `semexprs.semOverloadedCallAnalyseEffects` that errors out with `errRecursiveDependencyIteratorX` ("recursion is not supported in iterators: '$1'"). The error message was even updated just a couple months ago. It is not hard to trigger this erroring out with trees that are "run-time switched" recursions vs. @mratsim's compile-time switched recursion. I for one would love full recursion support in iterators (and so would love to be wrong, if I am).
Re: What text editor are you using for Nim?
Obviously, editing time can be very different than start-up time, but it's easy to benchmark start-up time and it's somewhat indicative. In the early 1990s `emacs` with non-trivial `.emacs` configurations could take several to many seconds just to start up and load a file while `vim` would start-up in the blink of an eye (like `< 100ms`). These days, `vim` has become bloated (in default configs) while `emacs` has gotten faster (especially with `emacsclient`, and maybe even more so if this JIT compiler ever takes off). So, $ utime vim -c quit nada 0.012548586 0 0 0.0% $ utime emacs -q -nw --eval='(save-buffers-kill-terminal)' 0.017100606 0.01 0 58.5% Run or "both are pretty similar" (`12.5 ms` vs `17 ms`). With `vim -u /dev/null -c quit` which is a little more fair since I had `-q` on the `emacs` you see more like `1.8 ms`. (`vim --startuptime` doesn't show many hot spots, but a lot of miscellanous loads.) However, with `emacsclient`, its time goes down to a similar `2.9 ms`, only you get a fully loaded environment. And the "busybox vi" start up is like 190 microseconds, but boy is that a spartan environment - more analogous to vim of the early 90s. All these times, though, are "too fast to notice", as is the case with most everyday editing operations, unless there's some kind of loop. All probably have "scaling holes" they could patch up with some significant re-engineering. I mostly use `vim` and I've seen it grind to a halt with some really long lines in binary files. Developer time generally matters a lot more, though that often relies a lot on what one is "used to". Anyway, I thought some numbers and command-lines that readers could try on their own configs might add some objectivity. (Well, that `utime` thing is a homebrew command of mine...Sorry. Needed the resolution. [https://github.com/gsauthof/cgmemtime](https://github.com/gsauthof/cgmemtime) will do `ms` resultion anyway.)
Re: What text editor are you using for Nim?
While I doubt I disagree at all with @amalek's editor recommendations (brief aside - kakoune actually distributes a `nim.kak` though I don't know how well it works), it's also worth pushing back on the idea that lisp needs to be slow. Interpreters do tend to be, and interpreters of dynamic languages even more so, and I don't think vimscript executes very quickly. Compiled lisp, like Nim (or even Cython), can with a little "`(declare)` care" be roughly on par with C, depending on what one is doing. Of course, different languages have more or less "performance fragility" for lack of a standard term. In terms of people having emacs-extensibility with less steep performance tradeoffs, there is this nascent effort [https://tromey.com/blog/?p=982](https://tromey.com/blog/?p=982) It would be nice to see that go forward someday. Taking some 500 ms pause time down to 150 ms might make all the end-user difference necessary to really improve that experience.
Re: What text editor are you using for Nim?
I should perhaps also have mentioned that if emacs' garbage collection behavior was the real pause-time culprit for @amalek then that may not be any better with a JIT since the problem may be about memory loads more expensive than interpreter CPU cycles. [https://cse.buffalo.edu/~mhertz/bc-pldi-2005.pdf](https://cse.buffalo.edu/~mhertz/bc-pldi-2005.pdf) has some answers, but only for the "last level caching" hard disks/SSD data in RAM. It's possible deeper cache (like L3/TLB) thrashing is behind some `emacs` GC pauses. It's likely that much of the perceived "slowness of Lisp" is about such GC-(hw|OS) impedence mismatches. That's not fundamental, though. Nim's GC usually works fine. In my limited experience almost no elisp packages (or users!) try to tune GC behavior, though maybe recently that's been different ([http://bling.github.io/blog/2016/01/18/why-are-you-changing-gc-cons-threshold/)](http://bling.github.io/blog/2016/01/18/why-are-you-changing-gc-cons-threshold/\)). My personal guess is that optimal settings are probably _dynamic_ , related to competition for L3, and should not be some fixed 800kB number. Maybe @Kaushalmodi knows? Anyway, editor performance tuning is a bit off the main thread topic.
Re: How to add a symbol to represent the infinity (∞)
I assume you realize that your second example will not get to "Year Zero" since it sure looks like you are requesting an infinite number of B.C. years? Not sure you want to build an entire lazy evaluation system and symbology like Perl6..It's easy to do an infinite iterator by just doing your own iterator called `fromXtoInf` [https://nim-lang.org/docs/tut1.html#iterators](https://nim-lang.org/docs/tut1.html#iterators) (and you could put that unicode ∞ symbol in the iterator name if you want, but you might need to backquote the symbol). I'm really not sure, though, how many would think that is clearer than simply: number = 0 while true: number.inc ... Run
Re: How to add a symbol to represent the infinity (∞)
@LeuGim \- nicely done & interesting approach! I'm not sure I would have embued `inf(0)|Infinite(0)` with as precise semantics in your `for 3 .. inf(0): ...` construction. Infinity is more a process than a number. For `∞ - ∞ ~ inf(0)` to act like zero the two limiting processes have to precisely cancel out. In some abstract sense, that's much less { infinitely less? :-) } likely than being lopsided and the result being more like either `-∞` or `+∞`. Without more context there's not enough to know which (or zero). So, being undefined is probably a better choice. Also, for triple bonus extra credit along these lines you could try to do the point at infinity in the complex plane/Riemann sphere. ;-) { [https://en.wikipedia.org/wiki/Riemann_sphere](https://en.wikipedia.org/wiki/Riemann_sphere) }
Re: Compile C file together with Nim one
Yet another approach is to have the `nim c`-generated C code just `#include "foo.c"` via `{.importc: "func", header: "foo.c".}`.
Re: What do you think about the programming language NIM?
I do not agree that lazy linear lists are the "main example" of recursion. They may be the _simplest_ example, thusly allow several easy workarounds. I mentioned "trees" right in my comment. Most of the (admittedly subjective) elegance of tree algorithms comes from the way recursive code structure mirrors recursive data structure. `lib/pure/collections/critbits.nim:iterator leaves` shows some ugliness required to workaround there, for example. I agree it may not be a high priority or even in the top 10. I do not think it would be considered a "no real need" extra/extraneous fluff/"feature bloat". I know you didn't say that, exactly, but I felt your post risked leaving that impression.
Re: What do you think about the programming language NIM?
There are always workarounds since CPUs are little state machines. That should be news to no one. Recursion working makes code look much nicer, IMO. For what it's worth, the example considered among the more compelling by the Python guys back in the Python 2.3 days when they added generators was in-order binary search tree traversal: iterator inorder[T](t: Tree[T]): T = if t: for x in inorder(t.left): yield x yield t.label for x in inorder(t.right): yield x Run which parallels the usual recursive call structure quite well. I think it would be great if the above Nim actually worked. Equivalent state machines are often ugly, trickier to get right, and a pain point (and yeah, usually faster). But "expressive" has always been a main goal of Nim. :-)
Re: What do you think about the programming language NIM?
Oh, I had meant to include `{.closure.}`. Oops. Fixed. A stack (of something) is fundamentally required for general non-linear/tree cases. TCO can only help your linear cases (partly why I call them "easy"). Good trees have well bounded depth, but yeah, wildly unbalanced ones can cause trouble for **any** impl, recursive or not. A non-recursive one may minimize the per-stack entry storage a bit better, though.
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
I agree with @siloamx. I would also point out that for many less compiler/parsing-sophisticated programmers, splitting is conceptually simple enough to make all the difference. Such programmers may never even have heard of "lexing". This is just to re-express @Araq's point about it being "naive" in maybe a slightly more accomodating way. Probably, they should learn, but maybe they want to focus on some simple problem. Formats aren't always in a strict/pure table format amenable to `parsecsv`. I have some routines in `cligen/mslice` ([https://github.com/c-blake/cligen](https://github.com/c-blake/cligen)) that may help for this. Some example usage is in `examples/cols.nim`. In the `mmap` mode of `cols`, I get ~50% the run-time of GNU `awk` for field splitting, though I have admittedly never tried with 15,000 columns. One final, related point is that `gunzip` can be very slow at decompressing and is single-threaded. If the average column width is ~20 Bytes, 300k*15k*20 =~ 90 GB. Working with the uncompressed file directly or using something that can decompress in parallel like `Zstd` ([https://en.wikipedia.org/wiki/Zstandard](https://en.wikipedia.org/wiki/Zstandard)) may help **a lot** , especially if you have 4-12 idle cores as many do these days. On Linux, you should be able to just let f = popen("pzstd -cdqp8 < myfile.zs".cstring, "r".cstring) Run (You may have to, just once, `zcat file.gz | pzstd -f -p8 -19 > file.zs` and, of course, this will not help if you have a pile of `.gz` files only to be processed once.)
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
In fact, that VCF format does use `\` escapes ([https://samtools.github.io/hts-specs/VCFv4.3.pdf)](https://samtools.github.io/hts-specs/VCFv4.3.pdf\)). So, basically all the above code examples are indeed wrong, and @Araq's advice is the right general advice (I did say they should probably learn). That said, I also still think there are many situations both quick & dirty and otherwise where simple splitting makes sense and is not wrong. Why else provide it in `strutils`?
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
On slightly closer inspection of that spec, it seems that backslash quoting only happens in the `##` comment sections ignored above. So, maybe they aren't buggy after all if that data is not important to the calculation in question. A further point along the lines of "if we're going to provide `split`, maybe we should provide faster variants" is that the times when it is going to matter most will be for huge inputs where the columns may well have a more regular nature like numbers and specifically forbid more complex lexical structure. There it might A) be correct, B) performance might matter a lot in human terms, and C) the programmer might mostly be non-sophisticated with regard to even terminology like "lexing" or "vectorized memchr". This all seems to be the case with this VCF thing, but I feel like it's come up quite a few times over the years. It may not **always** be "bugs running faster". :-) Anyway, I don't think "to force people to learn new terminology/techniques" is a very welcoming answer. So, I tried to provide something more welcoming. Even if their parsing is sloppy & error prone, I think naive programmers facing consequent errors on their own data sets rather than complaining about Nim library performance is better optics for Nim. All that said, I think we agree 100% that we probably need more information from @markebbert to help him any more with his actual problem. Maybe it is IO. Maybe he didn't even compile with `-d:danger`. If he's on Linux, I would suggest him decompressing first and trying my `mmap` versions. 90 GB/(100 MB/s) =~ 900 seconds =~ 15 minutes. Heck, some people even have 90GB of RAM. :-)
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
FWIW, I suspect the answer to all this noise is that @markebbert was simply not using an optimized compile (as suggested by the very first line of the very first response to him). @jyapayne \- what I did was go to [https://vcftools.github.io/index.html](https://vcftools.github.io/index.html) and download the distro and find the biggest VCF file in there (`contrast.vcf`). Then I did something like this: #!/bin/zsh -f head -n46 contrast.vcf > head.vcf tail -n+47 contrast.vcf > tail.vcf hardTab=$'\t' ## `cols` is from `cligen/examples` cols -c -d "$hardTab" {1..7} < tail.vcf > tail-more.vcf ## Above has 106 distinct rows. Probably diverse enough. paste tail.vcf $(repeat 2500 echo tail-more.vcf) > tail-wide.vcf cat head.vcf $(repeat 30 echo tail-wide.vcf) > big.vcf Run That last file is only about 1.5 GB, likely about 60x smaller than @markebbert's but should have otherwise similar statistics (15,000 columns) and fit in almost anyone's RAM. (It also compresses via Zstd to under 500 _kB_ due to the way it was synthesized and decompresses in under 1/4 of a second for me...) Then I just ran his initial Python and Nim (dropping the gzip stuff and adjusting path names). I reproduced the unsurprising Nim debug-build slowness (11.8 seconds) with his Python running at 4.81 seconds. Then with `-d:danger` got the Nim running in 2.07 seconds about 2.3X faster than the Python. Then, just for kicks, I did a version using libraries that I alluded to which re-uses the same two `seq` for column outputs and got the Nim to 0.984 seconds, almost 5X faster than his Python: import cligen/[mfile, mslice] proc main() = # var genotype: string var i = 0 var cols, subCols: seq[MSlice] for line in mSlices(mopen("big.vcf")): if line.len > 0 and line[0] == '#': continue discard line.msplit(cols, '\t', 0) for col in cols: if i >= 9: discard col.msplit(subCols, ':', 0) for fmt in subCols: # genotype = $fmt #Py did not do anything here.. break inc(i) main() Run What further optimizations make sense, such as eliding many splits by bounding columns, using iterators rather than `seq`, etc., ultimately depend upon what further calculation he was intending to do with the parsed data. Scaling up to his problem size, this last version would translate to about 1 minute run time vs his initial 400 minutes. "In real life", likely 90+% of his time would be spent waiting on `gunzip`. Literally any compressor allowing parallel decompression (`pixz`, `pzstd`, etc.) or just fast single-threaded decompression like `lz4` would probably be much less of a pain point for him. I realize, though, that he may have piles of giant data files already "trapped behind gunzip" in ways beyond his control. Converting them may still help him if he has many repeated calculations to do and the disk space.
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
Well, I don't really have a blog. So, this is what you get. ;-) Someone else can, though. Ideally, just give me credit by linking back to here. Or if you can make any of that `cligen/mslice.*split*` faster then a PR is welcome. As a slight update, storing `big.vcf` in `/tmp` (a tmpfs aka `/dev/shm` RAM filesystem) and doing profile-guided optimization with gcc (via a little `nim-pgo vsn2 ./vsn2` script I have around), I get about 15% improvement down to 0.85 s. Linux `perf` tells me about 58% of that time is just in `__memchr_avx2` which is already hand assembly tuned. That is probably about as fast as you can get (at least on my Skylake generation CPU) in any language. You might be able to eek out another 10..20% or more if you hand-rolled everything in assembly and did just the right prefetches/branch predictor gaming. Or you might not. Parsing at 1.85 GB/s isn't so bad. For reference, on that machine single-threaded RAM bandwidth is about 32 GB/s, and as mentioned Zstd can spit out that compressed file about 3x faster { though that is multi-threaded over 4+ cores..So, it's actually slightly slower on a single-core basis }. Anyway, I doubt there is any real Nim problem here. A better use of time might have been to just wait for @markebbert to respond to the very first line of the very first response.
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
I don't know. Thanks for the implicit compliments and all, but it's an awfully specific set of circumstances, not a general performance analysis even of VCF never mind DSV parsing. The performance will be closely related to how long various column substrings are. That will be specific to the data I just made up from a sample file to have something reproducible by others. Someone with real data files like @markebbert would be in a better position to say something interesting about VCF parsing. { I also continue to agree with Araq that for this kind of situation it would be better if someone who's read that PDF which I linked to more thoroughly than I did (or want to) wrote some careful VCF parser API in Nim. } So, I'm not sure what the post would be "about" exactly except perhaps some implicit performance one-upsmanship of [https://nim-lang.org/blog/2017/05/25/faster-command-line-tools-in-nim.html](https://nim-lang.org/blog/2017/05/25/faster-command-line-tools-in-nim.html). Such one-upsmanship is often an almost never ending game of diminishing returns and benchmark debates that I don't have time/inclination to pursue further at this moment. Also, it's just code outside the stdlib anyway. Feels a little like someone else's yard to me rather than some specifically Nim-conventional way to do something. `cols.nim` is already small 45-line example program, and the above snippet even smaller. But by all means, if you or Kaushal want to write more about it and link back here then you should! - cheers
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
You're welcome. The usual behavior for compilers is to generate slow code about as quickly as possible, maybe also with the best debuggability, and do things like add `-O`/`-O2` to get faster code generated more slowly. I don't think Nim should vary from that. I am not entirely sure how much speed-up comes from `-d:release`/`-d:danger` and just using the optimizing modes of gcc. If people are coming from Python they are going to have to learn about compilation options sooner rather than later. Someone, like in this case, who seemingly thought to switch to an iterator but not to try higher compiler optimizations..Well, I don't know how to prevent that. Their mind is _almost_ in the right place. We can just wait until they respond to the did-you-do-d:danger as a first question before worrying, though. I'm only about 90% sure that's even what happened in this case since we haven't heard back from @markebbert. This `split` question also has come up a lot, eg.: [https://forum.nim-lang.org/t/1580](https://forum.nim-lang.org/t/1580) [https://forum.nim-lang.org/t/2282](https://forum.nim-lang.org/t/2282) [https://forum.nim-lang.org/t/3557](https://forum.nim-lang.org/t/3557) It may not always/often be the best approach but, it is an operation provided by literally virtually every programming environment. Even ANSI C has `strtok`. So, for good or ill, it's often a "first comparision". So, I think we probably should have re-usable `var seq[]` accepting overloads in the stdlib as well as the iterator in case whatever sample code needs random access to post-split members.
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
@jyapayne \- not according to him. :-) But your two messages clearly passed in flight. As Araq mentioned, that `gzipfiles` module may need some work. You should do it! Personally, I have been avoiding gzip since at least 2007. There are just much better algos & tools now across every dimension except perhaps "nostalgia". :-) I think gzip should be retired when & where it can, but I understand that it remains popular. Also, I suspect doing a `popen` implementation for Windows in Nim's stdlib and just making that utterance I mentioned above more portable might provide higher long-term value. The library approach only really adds value over external commands when people have many small compressed files which is probably not that common a case. It's very easy to separate the concerns of "which compressor" with external commands. @markebbert \- Consider migrating to a parallel compressor and especially decompressor if your colleagues/data stakeholders would approve. You may also want to be careful about list comprehensions in terms of memory footprint. E.g, for one L1-cachable row they may be fine, but for the whole O(100GB) file a bad idea. Oh, and Nim really can be as fast as C/C++/Fortran. You should give it some more time when you have a chance. In some ways it's even simpler than Python (e.g. UFCS).
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
Nim is much closer to one language to rule them all than most, and I think it's very amenable to "gradual learning". I think the more you put in the more you get out. Also, my parallel-decompression-to-optimize-IO recommendations are language independent { they depend more on things like if you have RAM to just buffer the whole file and/or disk array/NVMe IO >> Zstd decompress output rates (~often 2 GB/s..several GB/s), etc. }. I realize that it can be a chore to re-encode terabytes of data, educate others why, and train them how to access. I expect it can take your 10 minutes to get through your data down to < 1 minute, though (not counting re-compression time, of course). It may be another "surprising order of magnitude" for you, but it's pretty off topic. So, I'll shut up about it. Anyway, good luck and open issues/ask questions in the forum if you have them!
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
Just some numbers on my machine for that exact file to make earlier points more concrete: 78 seconds to decompress with gzip (@jyapane must have a fast box!) 1598.59s (u) + 16.65s (s)=415.53s to recompress with pzstd -f -p8 -19 File Size Name 26966704829 ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf 1923680177 ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz 165270307 ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.zs (So, the Zstd file is almost 12x smaller.) 3.68 seconds to decompress with pzstd -cdqp8 Run So, 78 seconds vs 3.7s =~ 21X. Could be even bigger with more free cores. More than an order of magnitude time savings with also more than an order of magnitude in space savings, once the file is re-encoded.
Re: Can the return value of a proc be a variable marked {.global.} ?
While I agree with @mratsim basically across the board, to be fair to the proposal I think implicit was that there if the defaults switched there would be an easy way to activate debugging mode like `nim c -d:debug`. I think nim should be like all other compilers, though which is how it is currently. I think the best course of action is just a new `Hint` that power users can turn off that tells newbies @mratsim's "Reminder". There could be two, even, one corresponding to d:release and one d:danger.
Re: Can the return value of a proc be a variable marked {.global.} ?
Well, maybe open an issue or do a pull request. It seems like a 5-15minute task for someone (plus probably waiting hours on the CI).
Re: Can the return value of a proc be a variable marked {.global.} ?
@Kaushalmodi's issue, while interesting, is fairly independent since it almost certainly is just about `gcc` and not, say, `clang` (and quite probably just some range of versions of `gcc`). And people can fiddle with their `-O` levels in `gcc.options.speed` in `nim.cfg/etc`. I don't think it really argues/advises against @mratsim's warning/hint solution.
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
Because various Zstd ratios are all so large, it helps in many practical circumstances more than choice of programming language (which often plays in the 2x-5x range). Continuing with that one data file example, with just 4 cores the output rate is 7.3 GB/s. On an otherwise idle 16 core system, I get decompress times of 2.26 sec => output rate of 11.93 GB/s (1 TB in 83 seconds, 1 PB in a day). These numbers are really much faster than almost all IO systems these days. (Not "all", but probably "all except crazy fat RAIDs of 4 or more NVMe's"). To keep that data pipeline fed, you only need 165 MB/2.26s =~ 73 MB/s which is also well within reach of most IO systems. So, getting that 7-12 GB/s vs gzip's paltry 345 MB/s decompress rate is quite achievable and 20X..34X speed boost vs gzip is compelling. Assuming that compression ratio of 163:1 holds up for other VCF files, that petabyte of VCF output would only take up like 6 TB disk which a lot of people have lying around these days. And people like markebbert are apparently patient enough to wait for 7 hours. So, in some sense, petabyte-scale home computing has been around for half a decade now "on the down low for the somewhat patient". I'm a little surprised it isn't more well known/widely discussed. Not everything compresses like gangbusters the way this VCF data does, though.
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
After `-d:danger -d:release` and gcc-9.2 on an i7-6700k at 4.8GHz this runs in about 46 seconds for me against the decompressed file in a RAM filesystem: import cligen/[mfile, mslice] proc main() = for line in mSlices(mopen("big.vcf")): if line.len > 0 and line[0] == '#': continue var i = 0 for col in line.mSlices('\t'): if i >= 9: for fmt in col.mSlices(':'): # do something with $fmt break i.inc main() Run Note that the i=0 should be moved down for the loop to be similar. The first `[9..]` Nim slice version was not quite converted correctly to the iterator version. That mistake propagated to @jyapayne's version. Mark may well have caught that already. Also note that there was some kind of misunderstanding earlier about `maxSplit` helping a lot, but the code seems to want to parse all columns _except_ the 9 early header columns. Anyway, time beyond about 40seconds .. 1 minute of "parsing time" should just be IO. And that IO could be reduced to probably 4 seconds with Zstd, but may be more like 1.5 minutes with gzip. So, I might expect times somewhere in the 2-3 minute range for @jyapayne's sample file. The statistics of the two files sound pretty different. @jyapayne's eg file is is 3.5e6 \n chars and 8.7e9 \t chars while @markebbert reported (reportedly) 300e3 \n chars and 4.5e9 \t (300k*15k). I might expect that @markebbert's might run faster, but his data columns may be larger.
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
You're welcome. @jyapayne \- Well, there is this import strutils, posix proc main() = for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)): if line.startsWith('#'): continue var i = 0 for col in line.split('\t'): if i >= 9: for fmt in col.split(':'): # do something with $fmt break i.inc main() Run which takes about 255 seconds (4m 15s) on @jyapayne's sampel file on my machine, and then if you are willing to have a cligen dependency you can get it about 4x faster (67 seconds) with this: import strutils, posix, cligen/mslice proc main() = for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)): if line.startsWith('#'): continue var i = 0 let msLine = toMSlice(line) for col in msLine.mSlices('\t'): if i >= 9: for fmt in col.mSlices(':'): # do something with $fmt break i.inc main() Run Latter needs `popen` analogue on Windows (I bet there's one in the stdlib, but I always just use `popen`), and he'll have to `nimble install cligen` but he may want to anyway. Another perhaps non-obvious note about the `popen` approach is that the decompressor runs as a separate process, and so in parallel if you have at least two idle cores. Hence, run-time tends to be max(decompress time, parse time) rather than sum() of the two times. (Bandwidth across a unix pipe is usually >15GB/s anyway.) Anyway, Mark could probably run that 2nd version on his file. It might be a little faster or slower, but is probably within 2x which is still 5x faster than Groovy. Might be more than 10x faster. To answer some of @markebbert's compression questions..It's a lot to cover, but I'll try to be brief. The `.zs` or `.gz` or `.bz2` or `.xz` formats are all different as are the related compression algorithms, but some tools support old formats for compatibility. You won't necessarily get any performance boost in any dimension (time/space/etc), though. To get good parallel speed-up of decompressed output, the compressor must prepare the input specially (in N independent file regions - so that N threads can be decompressing their streams independently). `pixz` can create such independent region `.xz` format files decompressible with parallel speed-up by itself or without parallel speed-up by regular `xz -d`. There is a similar `gzip` tool called `pigz` that can do similar for `.gz` files. I don't know if `pzstd` can do like `pigz` for `gzip` with independent regions. `pigz` itself always used to focus only on compression (the slow but do it only once part) instead of decompression (the do it many times wouldn't it be great to be fast in parallel?). Even if `pzstd` can unlike `pigz` get parallel un-gzip going, you won't get the giant 163:1 compression ratios, though - in fact they will be somewhat worse ratios than regular `gzip -9` ratios. Moving to Zstd native compression algos & formats is probably the best advice, but also the most work in terms of education, persuasion, etc. as I've mentioned before. Of course, even my `memchr` based `mSlices` only runs about 2x faster than `gzip` decompresses. So max(4 seconds, 45 seconds) will be ~45s which isn't that much better than ~67s. So, "costs in context" and all that may apply. (I think 67s is better than my earlier 78s due to cache effects - decompress to the pipe & throw away all in L2 vs decompress to a RAM filesystem if anyone is keeping tabs on my numbers that closely which admittedly seems unlikely.)
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
For what it's worth, and for completeness if Windows portability even matters in this case (as @markebbert mentioned, these science things are often one time deals), this works but is 6x slower (405 sec aka 6min 45sec) than the `popen`/`mSlices` variant: import strutils, osproc, streams, cligen/mslice proc main() = let p = startProcess("gzip -dc < big.vcf.gz", options={poEvalCommand}) let outp = p.outputStream var line = newStringOfCap(4096).TaintedString while outp.readLine(line): if line.startsWith('#'): continue var i = 0 let msLine = toMSlice(line) for col in msLine.mSlices('\t'): if i >= 9: for fmt in col.mSlices(':'): # do something with $fmt break i.inc main() Run That `streams` code needs some better line-buffering love, though { Or `osproc` could use `File` instead of `Stream`}. `system/io.nim:readLine(File,..)` used to be a similarly slow almost identical implementation. But, the clear speed winners so far are either the `mopen` variant decompressed if you have the space/RAM or, if you run on Unix, the `lines(popen())`-`mSlices` variant (re-encoded with `pzstd` if you need to process the same file many times). { If `nimble install` doesn't work for you, in a pinch, you could always `git clone https://github.com/c-blake/cligen`, copy `cligen/mslice.nim` into the same dir as your program and adjust the `import` to its unqualified name. I get that Araq doesn't want to rely upon libc `memchr` being fast or support different compile-time/run-time versions, but 4X slower is a pretty big hit. That's why I tossed `mslice` into `cligen` so others might benefit. I'm not even sure `mSlices` is as fast as possible and as I mentioned various overheads clearly depend on string/substring lengths. }
Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?
It looks like `nim-faststreams` is just to bridge the API gap between mmap IO and streams interfaces which is nice and all, but won't help for `startProcess` or other `popen`-like contexts where in this discussion streams slowness was a problem. { @mratsim didn't say it would, exactly, but I thought I should clarify (unless I'm wrong). } More importantly, perhaps to @markebbert's surprise, it turns out that the bioinformatics guys are being smarter than `pigz` with this `bgzip` tool which supports parallel decompression. That very same example file that's been my running example since @jyapayne mentioned it is in fact `bgzip` d. It can decompress in only 11.1 seconds on the same machine with 4 threads. (I got it down to 3.5 seconds with over 25 threads but that same Zstd example gets down to 2.2s using similarly many threads). So, an `mSlices` \+ `popen("bgzip -d --threads 4 < ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz")` version should be able to go in roughly 1 minute for that file. As @brentp surely knows, his `hts-nim` library actually supports a `set_threads` and `read_line` that can deliver that, too. It's entirely possible @markebbert could have been getting that much faster decompress time all along (with python and groovy as well via a popen analogue). Or maybe his `.gz` is not from `bgzip` and he couldn't. The way for him to tell is to run `file` on the `.gz` and see if it has "Blocked GNU Zip Format (BGZF; gzip compatible), block length" instead of simply "gzip compressed data". Of course, the compression ratio will still be 12x worse than Zstd. Besides the obvious space waste, rather than a tiny 45-75 MB/s you would need backing IO of 172..542 MB/s (depending on 4 or 25 threads). On the high end many thread case, you're looking at either a fast SSD or a RAID array of spinning disks to feed that analysis pipeline (assuming zero cost parsing, anyway, which I know his example calculation did not have, but that assumption simplifies since he didn't really show a full calculation).
Re: Getting Unix-Style file permissions
It may also depend on how "standard" the desired string is whether @jasper's solution is enough for @cnuzzi. Usually, the setuid, setgid, sticky bits have some `'s'` or `'t'` substitutions for the `'x'`, and capitialized `[ST]` if one of those bits is set but the corresponding execute permission is missing. I think Unix standardization efforts were disintegrating by the time they were getting around to this level of the userspace utilities. So, @jasper may need to use something like this (from [https://github.com/c-blake/lc/blob/master/lc.nim](https://github.com/c-blake/lc/blob/master/lc.nim) search for `proc fmtPerm`): import posix proc fmtPerm*(m: Mode, s=""): string = ## Call with ``.st_mode`` of some ``Stat`` to get rwxr-x..; ``s`` is optional ## separator string such as a space or a comma to enhance readabilty. let m = m.uint and 4095 const rwx = ["---", "--x", "-w-", "-wx", "r--", "r-x", "rw-", "rwx" ] result = rwx[(m shr 6) and 7] & s & rwx[(m shr 3) and 7] & s & rwx[m and 7] let o = s.len if (m and 0o4000) != 0 and (m and 0o100) != 0: result[2] = 's' #setuid,+x if (m and 0o4000) != 0 and (m and 0o100) == 0: result[2] = 'S' #setuid,noX if (m and 0o2000) != 0 and (m and 0o010) != 0: result[5+o] = 's' #setgid,+x if (m and 0o2000) != 0 and (m and 0o010) == 0: result[5+o] = 'S' #setgid,noX if (m and 0o1000) != 0 and (m and 0o001) != 0: result[8+o] = 't' #sticky,+x if (m and 0o1000) != 0 and (m and 0o001) == 0: result[8+o] = 'T' #sticky,noX Run I didn't see any direct code to do this formatting in `posix`, but you surely would `import posix` to get `Mode`. That code just uses the raw low 12 bits, but those are actually pretty standard.
Re: Getter and Setter methods in Nim
Should probably also be mentioned/of interest to @torarinvik that Nim has fancier support for setters as well. See [https://nim-lang.org/docs/tut2.html#object-oriented-programming-properties](https://nim-lang.org/docs/tut2.html#object-oriented-programming-properties)
Re: Winning the Base64 benchmarks.
Back when I used Cython more, I did always use Cython and just gradually type in things as required (or for some things do the equivalent of #include some C for some SSE intrinsics type codes). I agree it is not generally "popular", but it is a very legitimate mode of usage. Adding more and more cdef's etc. brings your code closer and closer to C semantically while staying in Cython syntax. Partly, I had better profiling tools for such code than "pure python" style. They even have a nice \--annotate mode supporting this usage style that spits out an HTML page with clickable generated code colorized by how C-like it is. That kind of source to "assembly" visualization tool would be nice for Nim as well, or even just gcc/clang going all the way to "real" assembly. It's got a bit more "oomph" for things spanning orders of magnitude of efficiency like pure Py and C, though, and the CPython API calls Cython generates may be (a little easier to read than real assembly, much like the C code Nim generates. Now, why good modes of usage (of almost anything) are not more popular..well, that's some question for the ages. People are imitative. "It's not popular because it's not popular." How to bootstrap something catching on is just..tricky.
Re: Difference between two dates
If you need a larger range and only care about "dates" only, there is a pretty simple formula valid for all dates after 4800 BC in [http://www.faqs.org/faqs/calendars/faq/part2](http://www.faqs.org/faqs/calendars/faq/part2)/ question 2.16.1
Re: Some questions about cligen
At present your best bet for your first style is to just have the wrapped `proc` receive a `seq[string]` and just test it yourself. Note though, that you aren't really getting much value out of any option parsing framework vs. just using `os.cmdLineParams()`. While it could be more elegant/automatic, the second thing you ask about works fine with the current `cligen` as shown by `test/MultMultMult.nim` in the `cligen` repo.
Re: Get first element in Table
Minor correction to @mratsim \- if you want to iterate and break after the first then you can use _any_ of the table types, not only `OrderedTable`. E.g. import tables var a = {1: "one", 2: "two"}.toTable for key, val in a: echo key break Run Of course, Nim `Table` somewhat unusually allows keys to be duplicated (as in an C++ STL `multimap` but with no ordering guarantees). So, you cannot be sure (without more work) which of the possibly several entries that first slot corresponds to. You may know that the keys are unique by construction/etc., though, and want to pop them off in whatever "fast order" is convenient. Indeed, the `HashSet` in `sets` even provides a `pop` proc. E.g., import sets var a = ["one", "two"].toHashSet let x = a.pop echo x echo a Run
Re: Get first element in Table
Of course. I suppose I should have been explicit to mention things are "system/data-structure ordered" not "key ordered". Any iteration has "an" order, though. That weird order just may or may not be what is desired or it might be fine. In terms of your `sorta`/B-Tree I would say that (like most fledgling ordered tree implementations) it does not separate "seeking" from "updating" (inserting/deleting). Search trees are often even more useful when augmented. Augmented search trees allow multiple kinds of (useful) order. E.g., B-trees need merely one more number per already large node and a little extra maintenance of that number to support seek-by-rank as well as seek-by-key (thus morphing into an order statistics tree). So, a better design is one organized around a "path handle", e.g., `path: seq[tuple[ptr,index]]`. Seek-by-X procs can populate said `path`, updating procs can act upon it, and iterators can start from whatever kind of seek with next/prev updating some path handle. Done right the augmentation can be totally optional, decided at instantiation-time. Layering a more traditional `Table`-like interfaces on top of that path-oriented internal design is then much easier than trying to implement everything directly.
Re: Get first element in Table
(Sometimes this path handle I am advocating is referred to as a "cursor" in analogy with text-editing. I don't usually see it motivated by coding economy/simplicity for optionally augmented trees, as I am motivating it here. Anyway, these comments/analyses are probably better made as issues on your git repo.)
Re: Nim for Statistics
> we just Nim someone to write the package Assuming you meant "need" this is the best ever either slip of the tongue/keyboard and/or entry auto-correct. "Nim".."need"..all the same. ;-)
Re: Nim for Statistics
Mostly academic points to correct the record, but I feel like mratsim's linked "Fenwick Tree" uses up to 2x the cache memory of the trees of Fenwick 1993,1994,1995. So, maybe not quite even state of the art for the mid-1990s (depending on metrics/data sizes)? Also, maybe not really even Fenwick Trees? The Yu2015 paper referenced in the code _very_ questionably re-brands what Wong 1980 had called simply "binary trees" as seemingly new but not "F+Trees" while removing the space optimization that was Fenwick94's contribution over Wong80. All quite weird. Don't get me wrong - that sampler is good code. Maybe not optimal, though.
Re: Nim for Statistics
For an alphabet size `A` (any int), an actual Fenwick Tree needs only `n = ceil(log_2(A))` array slots. If `A` is a power of two they need no extra memory over a simple histogram. Your `fenwicktree.newSampler` uses `2*n-1` (for that same `n`). And it's not some +-0/1-origin indexing threshold thing, but a legit 2x through the whole last power of 2. So, we don't disagree on what you use, but on what you _could_ use. That (admittedly confusing) paper even says "In fact, Fenwick Tree can be regarded as a compressed version of the F+ tree" (which itself kind of exhibits their terminological weirdness). It is to this "compressed" that I refer. If you don't want to think about it for a year that's ok. It is pretty far off topic. There are even further off topic things like [https://arxiv.org/pdf/1612.09083.pdf](https://arxiv.org/pdf/1612.09083.pdf) that warrants a mention for the curious.
Re: Nim for Statistics
Oh, and while I'm sure mratsim already knows all this, because this is a thread about Nim and statistics and R and other things and most passers by/search results finers probably do not know this, it bears noting that these binary indexed sum trees are often the fastest way to compute things like medians or other quantiles or other order statistics over moving data windows, at least up to a certain precision aka number of bins. Such CDF operations are almost the same as sampling without replacement. In both cases one is maintaining a CDF and/or its inverse as data comes & goes (with moving data windows the data both comes and goes while with sampling it only goes). With Wong/Fenwick/binary/Fibonacci/whatever decomposition indexed sum trees, the per-datum scaling becomes about the _alphabet size_ or precision **not** the window size. (There is always an overall O(N data points) factor for the whole data set not just each window.) By comparison, the R `runmed` implemented in `Srunmed.c`, for example, has atrocious `O(w)` scaling in window size `w`. `Srunmed.c`, while better than the _most_ naive sort-every-data-window-not-even-with-a-radix-sort `O(w*log(w))` is very poor work. It does not even do the binary-search-of-an-ordered-array-with-AVX-optimized-memmove adjustments which would be both simpler than what `Srunmed.c` already does and a significant constant factor speed up as well as generalizing to arbitrary quantiles. These index decomposition sum trees' (probably the most suggestive name) `w`-independence is even better unless you need too many bins. 64..256 KiBins is a lot of resolution and fits in L3 these days for even 8 byte counter sizes, though 4B or even 2B are often enough. This is obviously easier with the 2x space optimization I mentioned above. True full precision for index decomposition sum trees needs a bin for every possible number or say 2^32 bins for a `float32` which can get very memory intensive or require expensive-in-the-small sparse arrays/hashing. At loss of precision with numbers (as opposed to mratsim's words case), you can have some, e.g., exponentially spaced bin mapping from the full range to a more manageable amount. If you need full precision/exact answers then usually the best performer is a B-tree augmented to be an order statistics tree can get you "full precision" with `O(lg(w))` per datum scaling. Even a ranked binary search tree as per KnuthV3 in the late 1960s is better than R's `Srunmed.c`, though. `w` can easily be in the 1000s and N millions or more. So, R's current core impl is potentially leaving many orders of magnitude of performance on the table. Of course, people using R are rarely very sensitive to performance. /end-R-impls-a-lot-but-has-longstanding-poor-impls-of-even-basic-stuff-rant| ---|---
Re: Marshal and friends
Totally agreed, but it bears noting that Python also has a module named `marshal` with a `dump` and `load` that has generally been much faster/more space efficient than `pickle`/ `cPickle`. Basically "pickle but even less readable". So, one could also think of Nim `marshal` as like Python `marshal`. :-)
Re: Marshal and friends
Well, fair point and a nice link. I mostly thought identical names for roughly identical things and worth mention. Since this thread is about storing to disk, I should mention the too often neglected option of near-zero overhead native-binary-layout file formats that are "`mmap` & go" with no repeated parsing or even unnecessary paging. The file system can basically become your memory allocator. Ascii->binary and especially binary->Ascii is an expensive conversion for numbers (though any cost is always "compared to what"). Native binary does break "portability" to e.g. machines with different endianness/byte-order or non-IEEE floating point formats and whatnot. In the modern era, little endian and IEEE FP has won to such a degree that in practice this may matter little, if at all. In the old days, sharing a network volume between e.g., a big endian Sparc and a little endian Intel box was more common. Anyway, Nim supports this all fine (as in, e.g., my [https://github.com/c-blake/suggest)](https://github.com/c-blake/suggest\)), but A) that is _far_ from pedagogical code, { but hey, at least it's a fully worked example }, and B) it can often be error prone work in general. Someone out in Nim land should maybe do a macro package someday to make it easier (unless they already have?).
``Table.take`` should be ``Table.pop`` -- discuss
Before we leave things as just `take` in `Table`-land for very long/a full release cycle and having to worry about backward compatibility, I thought this warranted a much wider discussion than a comment thread in a closed pull request: > [https://github.com/nim-lang/Nim/pull/12600](https://github.com/nim-lang/Nim/pull/12600) Also relevant side question should `pop` be added to `apis.rst`? My opinions are in the link, but others should chime in. This might also be of broader interest to relate to "rapid stdlib retractions in the Nim 1.0 era" if we did want to just rename rather than deprecate the brand new `take` overloads.
Re: ``Table.take`` should be ``Table.pop`` -- discuss
@mratsim \- it's not just you. Formatting looks pretty garbled. @araq \- sensible to me to duplicate/alias rather than deprecate. Pretty low-cost.
Re: the "type" of the curly-bracket structure
What Nim calls the Table Constructor as dom96 has correctly pointed out, the lisp world has long called "association lists", often abbreviated "alist". Since it is a macro you are writing, you could check that the kind of the NimNode is `nnkTableConstr` and if not emit a useful compile-time message for client code, as in something approximately like: import macros macro `@@@`(x: untyped): untyped = if x.kind != nnkTableConstr: error "triple-@ expects a {(,),(,)..} table constructor" var a = @@@{"fromStruct": "blah", "sub": {"joe": 10}} #var b = @@@[1,2,3] #Errors out with above message. Run
Re: ``Table.take`` should be ``Table.pop`` -- discuss
Ok. Just to close the loop on this thread I believe this accommodates all expressed views: [https://github.com/nim-lang/Nim/pull/12678](https://github.com/nim-lang/Nim/pull/12678) (except maybe one week pre-release implies backward compatibility forever view or the if one-table has a proc, all tables should have the proc view, but people didn't seem to comment on/care much about that aspect of this question).
Re: parseopt with negative numbers as positional arguments
It might not be quite as CL-user transparent as what you're looking for, but having the CL user quote a space in front of the number usually works as in: tempconvert \ -10 F --to=C Run or tempconvert " -10" F --to=C Run You might need to strip the whitespace off pre-parsing depending on what integer parser you are using.
Re: ``Table.take`` should be ``Table.pop`` -- discuss
An un-keyed `pop(Table): tuple[K,V]` more similar to `HashSet.pop` is mentioned in the PR commentary linked above. (I would personally just use the already-defined iterator than b3liever's inline expansion, but eh. Either way.) Totally suitable for another PR. I just wanted to keep the above one more focused.
Re: Best way to store/search/etc and an int-int data structure
For the curious, that sparse-dense array construction of mratsim's impl is useful when keys are sparse-subsets of a small universe and hails from Preston Briggs' Rice PhD thesis in 1992, but has this more self-contained follow-up paper: [http://dcs.gla.ac.uk/~pat/ads2/papers/p59-briggs%5b1%5d.pdf](http://dcs.gla.ac.uk/~pat/ads2/papers/p59-briggs%5b1%5d.pdf) If you need this to be an "associative array" with "satellite data" as values for given keys, that is a straightforward extension where the dense array can be an array of pairs (or you could have a parallel dense value array). At some slightly different point in the design space is Varghese's Aggregate Bit Vectors which do not allow associative extension, [https://cseweb.ucsd.edu/~varghese/PAPERS/icnp2002.pdf](https://cseweb.ucsd.edu/~varghese/PAPERS/icnp2002.pdf) , which is also pretty easy to implement. And, of course, `Table[int,int]` is not actually that slow either.
Re: Practical examples showing how macros lead to better code?
In addition to Andrea's very good examples, there is also [https://github.com/c-blake/cligen](https://github.com/c-blake/cligen) which can generate a command parser-dispatcher from `proc` signatures or from-command-line initters from types. Everyone knows about CLI parsers. I think it's a real time saver. The macro code itself is pretty hairy, though. Much macro code in Nim is pretty involved to handle many cases. If you are looking to just get a fully worked out introduction, I think that `memo` code is the best place to start.
Re: Proposal to start a Nim-Scientific Community
In Nim `float` should be the same as `float64`. `float32` is smaller which means less memory bandwidth needed as well as wider vector instructions in modern CPUs (e.g., on AVX you can fit 8 float32 in one register but only 4 float64). Some neural net people want to use a couple variants of 16-bit (or even 8-bit) floating point formats, while for certain very high precision cases 128-bit is becoming less rare. So, I don't think there is some perfect choice of size/accuracy and so your `SomeFloat` idea sounds smarter to me. You might also benefit from having "iterative" numerical routines accept some kind of "precision" parameter that defaults to something (probably dependent upon the concrete floating point type in play), but allows users to maybe trade off speed & accuracy themselves. I.e., if you have some series approximation with an error bound then allow users to pass the max tolerable error (either in absolute terms or relative terms). Nim's named parameter passing with defaults should make it easy to have a common convention in your library.
Re: Walking trees without recursive iterators
In the interests of a more self-contained thread and to perhaps more easily see what it's doing, here is a more brief version of @sschwarzer's algorithm: import xmltree type Handle* = tuple[kidIx: int; par, kid: XmlNode] #Ix of kid in parent iterator handlesHiToLo*(node: XmlNode): Handle = ## Recursively yield ``Handle`` objects for direct&indirect kids of ``node`` ## from high to low index order (on same level). { This order simplifies ## removing kids in-place without affecting later yielded kid indices. } ## The root node is included in yielded ``Handle``s with a ``nil`` parent. template recordYield(par: XmlNode, kidIx: int, kid: XmlNode=nil) = handle = (kidIx, par, (if kid.isNil: par[kidIx] else: kid)) yield handle proc high(node: XmlNode): int = node.len - 1 var done = false#flag to stop outer loop from inner var stack = newSeq[Handle]() var handle: Handle #Last node yielded recordYield nil.XmlNode, 0, node#Always yield the root first if node.len == 0: done = true #Nothing left to do else: recordYield node, node.high #Start from high node of root & proceed while not done: let kidHasKids = if handle.par.len - 1 < handle.kidIx: false #kid@kidIx was deld post yield else: handle.kid.len > 0 if kidHasKids:#Descend tree, build stack, yield last grandkid stack.add handle recordYield handle.kid, handle.kid.high continue while true: #Yield next (lower index) kid | maybe pop,find new if handle.kidIx >= 1: recordYield handle.par, handle.kidIx - 1 break if stack.len > 0: handle = stack.pop() else: #No more nodes to try on the stack. Done. done = true break Run If you want some other visitation order such as low-to-high then you have to adjust things a bit. For in-order/work in the middle you might need to adjust it a lot. I actually kind of doubt it can be simplified if you want to preserve the deleteability of nodes pointed to by yielded handles. In these sorts of situations, recursive code is often drastically more clear. A good programming language should facilitate clear code. Therefore Nim iterators should grow recursion capability someday. But I'm not signing up to implement it, and I wouldn't assert it's a high priority.
Re: Arrays and Sequences in nim
As a point of clarification, depending a bit on OS and sysadmin settings, one can usually increase stack limits with `ulimit -s unlimited` builtin for most shells, maybe `/etc/limits` or `/etc/security/limits.conf` on Linux or even _adjustable in-program_ by `setrlimit` (if not blocked by admin settings). (Nim's stdlib could use an `RLIMIT_STACK` constant to ease that last one.) The default limits are usually much less than say system memory divided by number of expected threads. It would be very rare that all threads used their limit anyway. When the limit is exceeded your thread/process just crashes, much as if you set the limit to be unlimited and used all system memory. The existence of low stack limit defaults historically mostly comes from systems with "spinning rust" rather than than NVMe backing store for swap. On such systems, an unbounded recursion could bring the machine to a crawl for all users. So, the limit was a safety feature. These days, no swap at all is a real possibility. Even when swapping is enabled, the limit is more about not having some unbounded recursion flush filesystem caches forcing re-reads. Also, kernel code often has small stack limits for related but distinct reasons (non-swappability of kernel stack/page fault handlers/etc.). Kernel developers tend to set default limits based on what they think is "good for you" based on their own very stack-conservative styles. Anyway, the short of it is that if stack is how you would prefer to manage your memory (for whatever reasons) there isn't anything intrinsically wrong with that related to "size". There's just a little deployment complexity, like maybe erroring out if admins have blocked stack limit increasing and your `setrlimit` fails which is atypical in my experience. (This is all Unix advice. I have little to no experience for how this plays out on Windows, but some more informed person can perhaps chime in.)
Re: nim android tutorial
Also, it's not an "app", but it bears re-iterating that Nim on Termux works just like Nim on a worstation/laptop/other posix-y environment. See, e.g, [https://forum.nim-lang.org/t/2891](https://forum.nim-lang.org/t/2891). I have a couple dozen Unix commands that compile and run just fine in both environments.
Re: Walking trees without recursive iterators
That sounds like a bug/limitation in @sschwarzer's algorithm to me, but maybe he can verify. Good catch, @e. Another pretty different approach that I use in `cligen/tern.nim` (that does not support in-iteration deletion) is something like this: const NUL* = '\0' type NodeOb*[T] {.acyclic.} = object ch*: char cnt*: int when T isnot void: val*: T kid*: array[3, ref NodeOb[T]] #0,1,2 ~ <,=,> Node*[T] = ref NodeOb[T] Tern*[T] = object ## A Tern can be used as either a mapping from strings root*: Node[T]## to type ``T`` or as a set(strings) if ``T`` is void. count: int## Number of elements depth: int## Depth of Tree iterator leaves[T](r: Node[T], depth:int, pfx=""): tuple[k:string, n:Node[T]] = type #Nim iterators should grow Which = enum st0, st1, st2, stR #..recursion capability so State = tuple[st: Which, k: string, n: Node[T]] #..this headache can be as if r != nil: #..easy as `print`. var stack = newSeqOfCap[State](depth) stack.add (st: st0, k: pfx, n: r) while stack.len > 0: let state = stack[^1].addr if state.n == nil: break case state.st of st0: state.st = st1 if state.n.kid[0] != nil: stack.add (st: st0, k: state.k, n: state.n.kid[0]) of st1: state.st = st2 if state.n.ch == NUL: yield (k: state.k, n: state.n) elif state.n.kid[1] != nil: stack.add (st: st0, k: state.k & $state.n.ch, n: state.n.kid[1]) of st2: state.st = stR if state.n.kid[2] != nil: stack.add (st: st0, k: state.k, n: state.n.kid[2]) of stR: discard stack.pop Run The above is probably pretty close to what compilers generate for recursive function calls. Of course, as mentioned I think things like this are more elegant/easy to understand: proc print*[T](p: Node[T], depth=0) = #2,1,0 gives std emoji-orientation if p == nil: return #..i.e. ":-)" head-tilt not "(-:". print(p.kid[2], depth + 1) echo strutils.repeat(" ", depth), cast[int](p), " ch: ", p.ch.repr, " cnt: ", p.cnt print(p.kid[1], depth + 1) print(p.kid[0], depth + 1) Run For `cligen/trie.nim` since I was already using a super wasteful uncompressed trie, I just punted on all that and spent memory proportional to the answer, i.e.: proc collect*[T](n: Node[T], key: var string, d=0, pfx="", i=0): seq[tuple[k: string, n: Node[T]]] = if n == nil: return if n.term: result.add (k: (if i > 0: pfx[0..
Re: Walking trees without recursive iterators
> What do you think? I agree that at least a few others would appreciate some fully worked out general tree iterator example and I encourage you to undertake the challenge. A possibly helpful side-comment is that Python didn't really have recursive generators for the whole 1990s, but had a similarly nicely terse `for`-loop syntax for iteration. Then it grew `for x in sameItr: yield x` capability and finally `yield from`. So, there may be example code from back then to work off of. Yet another possibility besides the several I mentioned and Araq's already mentioned visitor with a callback idea, would be to define an object holding all the state you need with a `next` method. Then the iterator could just instantiate one of those and call the next method until done. It would also be nice to be able to iterator forwards as well as backwards. Another side-comment pertaining to @rayman22201's "parent node" idea is that while it can make tree navigation easier, for multi-way trees similar to your XML case, this can be undesirable. The reason is that when the parent changes many nodes need to have their parent field updated. While one frequently sees red-black or AVL trees with a parent node, one almost never sees B-trees with a parent node due to that update cost. Of course, there's also the storage cost (N-nodes with parents instead of a depth-proportional stack).
Re: A path commonPrefix function
> Is there a nim equivalent to Python's timeit module for timing snippets? There are probably about 100 variants in people's local code, but I happened to just see your message and also to just add this to `cligen/osUt` the other day (needs `requires "cligen#head"` in your `.nimble` file or you could just copy this tiny impl and adapt as you like): import strutils, times template timeIt*(label:string, unit:float, places=3, sep="\n", body: untyped) = let t0 = epochTime() body let dt = epochTime() - t0 stdout.write label, " ", formatFloat(dt * unit, ffDecimal, places), sep timeIt("write1", 1e6, 3, " ") : stderr.write "hi" timeIt("write2", 1e6, 3, "\n"): stderr.write "there" Run You may want to add loops to repeat and use the `stats.RunningStat` stdlib API to do something nice more like Python's version. There is also this package called golden ([https://github.com/disruptek/golden](https://github.com/disruptek/golden)) to try to iterate until timing noise has been suppressed. And probably other various solutions, too. (And sometimes noise **is** the data and you want cold-cache/system competition effects included...benchmarking is obviously a larger topic than a snippet thing.)
Re: A path commonPrefix function
Are you aware of this binary search approach? [https://www.tutorialcup.com/string/longest-common-prefix-using-biary-search.htm](https://www.tutorialcup.com/string/longest-common-prefix-using-biary-search.htm)
Re: A path commonPrefix function
Eg., in Nim it might look like this: proc minLen(strs: openArray[string]): int = result = int.high for s in strs: result = min(result, s.len) proc check(strs: openArray[string]; j0, j1: int): bool = let srcSlc = strs[0][j0..j1] for s in strs: if s[j0..j1] != srcSlc: return false return true proc longestCommonPfx*(strs: openArray[string]): string= if strs.len < 1: return "" let src = strs[0] if strs.len < 2: return src var lo = 0 # Binary search var hi = strs.minLen - 1 while lo <= hi: let mid = lo + (hi - lo) div 2 if check(strs, lo, mid): # All strs match this result.add src[lo .. mid] # Append to answer lo = mid + 1# Go higher else: # Go lower hi = mid - 1 when isMainModule: import os echo longestCommonPfx(commandLineParams()) Run There are also a lot of variants at [http://rosettacode.org/wiki/Longest_common_prefix](http://rosettacode.org/wiki/Longest_common_prefix) For paths you would just need to add something to backup to the most recent directory separator, if any.
Re: A path commonPrefix function
`result += mid + 1 - lo` is, indeed, way cheaper than `result.add`, but that binary search approach only does the `result.add` about `ceil(log_2(minLen))` times (i.e., like 3-10 times probably). I couldn't even measure the difference on my 3078 entry average total length 17 `/usr/bin/\*` example (i.e. noise was >> difference). Regardless, I **100% agree** that separating into a prefixLen and then getting that slice would be a nicer factoring anyway. The caller may only need to test the length against `0` or something. Another algorithm of interest in terms of this whole thread is the "just compare first & last in sorted order" approach. This very "tidy" from both a conceptual and coding point of view: proc range*[T](xs: openArray[T]): tuple[min, max: T] = if xs.len < 1: return # default vals for T result.min = xs[0] result.max = xs[0] for x in xs: result.min = min(result.min, x) result.max = max(result.max, x) proc longestCommonPrefixLen*(strs: openArray[string]): int = let (min, max) = strs.range for i, c in min: if c != max[i]: return i return min.len Run As written, that's about 6x slower than the binary search approach for L2 resident data. It's probably possible to really speed up `range` for `openArray[string]` with some shallow copies in the loop and a real copy at the end. That might be a fun exercise for someone. Indeed, for very large inputs, bandwidth starvation of modern CPUs will mean that doing only one pass over said inputs is going to be best, even if it does quite a bit more CPU ALU-type work. So, even if that `range` cannot be sped up to match/beat the binary search in this context, one would still want to switch to this (or some other) one pass algo to accommodate giant inputs in a more general setting..probably past the L2 or L3 boundary (assuming it can run with minimal cache competition). @marks did not really discuss the size of his inputs or his use case, though his second "faster on his test set" is notably a multi-pass algorithm.
Re: A path commonPrefix function
> for s in paths[1 .. ^1]: The above line really slows things down. Pretty sure it's creating a copy of that slice every outer loop. Change it to `for j in 1 ..< paths.len:` and use `s[j][i]` below. I still timed your version as about 1.5x the run time of my version. That residual was attributable to your `foldl` vs my `minLen`. It might be simpler to just use my version completely when adjudicating it. :-) Also, if you can, you should measure/say something about your expected total number of strings, average length (total data), and typical answer length. A less embarasingly slow version of `range` for "expensive `T`" than my prior post is: proc range*[T](xs: openArray[T]): tuple[min, max: T] = if xs.len < 1: return # default vals for T var iMin, iMax: int # zero init for i in 1 ..< xs.len: if xs[iMin] < xs[i]: iMin = i if xs[iMax] > xs[i]: iMax = i result.min = xs[iMin] result.max = xs[iMax] Run With this new version, the one-pass version is only about 1.5x the run-time of the binary search version on that easily L2 cachable 3078 string avg len 17 bytes with prefix 9 running example of mine. Of course, as alluded to, the ratio of main DIMM bandwidth to L[12] cache bandwidth is almost always much greater than a small factor like 1.5..More like 4-5x. So, at some point the one pass range/string compare method will be fastest, but that point may not matter for your use case.
Re: A path commonPrefix function
And before someone corrects me, an even less embarrassing variant might be: proc rangeAt*[T](xs: openArray[T]): tuple[minAt, maxAt: int] = if xs.len < 1: return (-1, -1) # raise? for i in 1 ..< xs.len: # note tuple ints init to zero if xs[result.minAt] < xs[i]: result.minAt = i if xs[result.minAt] > xs[i]: result.minAt = i Run with then `let (minAt, maxAt) = strs.rangeAt` and tracking `(min|max)` -> `strs[(min|max)At]` changes in the string comparison for the prefix length algo. Some paired-up value oriented interface for those who want it would also make sense: proc range*[T](xs: openArray[T]): tuple[min, max: T] {.inline.} = if xs.len < 1: return # raise? let (minAt, maxAt) = xs.rangeAt result.min = xs[minAt] result.max = xs[maxAt] Run in which case maybe no change to the prefix len algo is needed. While off the topic of the prefix calculation, it perhaps bears noting that SSE/AVX calculations can sometimes speed up the value-oriented versions of these range computations over arrays of CPU supported types like `openArray[float32]` by large factors - 24x even. Basically branchless and batching vector min/max instructions can be leveraged, but I am unaware of CPUs supporting "minAt" type vector instructions. Recent versions of gcc can recognize these min/max type value sweeps and correctly generate optimized code, but such pattern recognition can often fail as the code grows just a little more complex.
Re: A path commonPrefix function
It's a bit tricky to explain why your `foldl` would be much faster than my `minLen` (note "min" not "max"). Perhaps you aren't compiling with max optimization/`-d:danger` or something? Or perhaps some peculiarity related to your test data. Total data size, number of strings, and size of the found prefices are more salient statistics than how many times you repeat, though another possible explanation is that 100k repeats in a tight loop does something weird with a warmed up CPU branch predictor. My timings come from a Unix shell loop doing 10..100 repeated runs of the whole program. As with much benchmarking, it's debatable what is most representative. I'll include the whole program below for reference. I called @marks' algorithm `lcpVertical` for vertically-oriented scanning and swapped `i` & `j` to have more traditional `j=column index` notation and slightly earlier exit. Because cligen allows unique prefixes `./lcp -alcpb /usr/bin/\*` is a valid way to run things in binary search mode. I don't do the final `rfind(sep)` work in any of those, but it's the same work for all the algorithms anyway and just a couple line wrapper `proc`. Anyway, I've run with a variety of inputs and things track my stated expectations perfectly (compiled with devel version of nim, d:danger, gcc-9.2, on Linux, i6700k). For really large inputs the range algo works best..for really small the binary search. So, an optimal at all scales would probably switch from binary search to range at some machine/system-load dependent threshold. stdlib-wise, the `range` LCP algo is never worse than about 2X the run time on my various /usr/xyz/* type tests, though. So, from a non-tuning/performance robustness point of view, it would be a better candidate for stdlib inclusion. Beyond the "one stop shopping" algo-wise and almost trivial implementation, its `rangeAt` & `range` helpers are much more basic/common operations than longest common prefix. Beyond that basicness, as mentioned the value-based `range` may afford strong-speed-ups for specialized types. So, it might be helpful to have a single point of reference that could be optimized/assumed optimized the way certain standard C library functions are, such as `memchr`. when defined(foldl): import sequtils # For me this is slower proc minLen(strs: openArray[string]): int {.inline.} = foldl(strs.mapIt(len(it)), min(a, b)) else: proc minLen(strs: openArray[string]): int {.inline.} = result = int.high for s in strs: result = min(result, s.len) proc check(strs: openArray[string]; j0, j1: int): bool = let first = strs[0] for s in strs: for j in j0 .. j1: if s[j] != first[j]: return false return true proc lcpLenBinSearch(strs: openArray[string]): int = if strs.len == 0: return 0 # maybe -1 instead? if strs.len == 1: return strs[0].len var lo = 0 # Binary search var hi = strs.minLen - 1 while lo <= hi: let mid = lo + (hi - lo) div 2 if check(strs, lo, mid): # All strs match this result += mid + 1 - lo # Append to answer lo = mid + 1# Go higher else: # Go lower hi = mid - 1 proc rangeAt*[T](xs: openArray[T]): tuple[minAt, maxAt: int] = if xs.len == 0: return (-1, -1) # raise? for i in 1 ..< xs.len: if xs[result.minAt] < xs[i]: result.minAt = i if xs[result.minAt] > xs[i]: result.minAt = i proc range*[T](xs: openArray[T]): tuple[min, max: T] = if xs.len == 0: return # raise? let (minAt, maxAt) = xs.rangeAt result.min = xs[minAt] result.max = xs[maxAt] proc lcpLenRange(strs: openArray[string]): int = if strs.len == 0: return -1 # raise? let (minAt, maxAt) = strs.rangeAt for i, c in strs[minAt]: if c != strs[maxAt][i]: return i return strs[minAt].len proc lcpLenVertical(strs: openArray[string]): int = if strs.len == 0: return 0 # simplified marks algo let first = strs[0] for j in 0 ..< first.len: for i in 1 ..< strs.len: if j >= strs[i].len or first[j] != strs[i][j]: return j return first.len type lcpAlgo* = enum lcpBinSearch, lcpRange, lcpVertical proc lcpLen*(strs: openArray[string], algo=lcpVertical): int = case algo of lcpBinSearch: return strs.lcpLenBinSearch of lcpRange: return strs.lcpLenRange of lcpVertical: return strs.lcpLenVertical proc lcp*(strs: openArray[string], algo=lcpBinSearch): string = if strs.len < 1: return "" result = strs[0][0 ..< strs.lcpLen(algo)] when isMainModule: # asserts are the Rosetta Code tests assert lcp(["interspecies", "interstellar", "interstate"]) == "inters" assert lcp(["thr
Re: A path commonPrefix function
Oops. Didn't survive some edits. Good catch as always, @e. Will fix in the prior post with a note. Does not effect my results which are like this: $ echo /usr/bin/*|wc 13077 58306 $ (repeat 50; lcp -alcpV /usr/bin/*)|mnsd 9 in 53.97 +- 0.30 microseconds via lcpVertical $ (repeat 50; lcp -alcpR /usr/bin/*)|mnsd 9 in 51.33 +- 0.42 microseconds via lcpRange $ (repeat 50; lcp -alcpB /usr/bin/*)|mnsd 9 in 39.29 +- 0.78 microseconds via lcpBinSearch Run (I use Zsh which has `repeat` and a little mean-sdev computer.) Then for a larger inputs: $ echo /usr/lib/python3.6/**|wc 1 39036 3108640 $ (repeat 10 lcp -alcpV /usr/lib/python3.6/**)|mnsd 19 in 2388. +- 14. microseconds via lcpVertical $ (repeat 10 lcp -alcpR /usr/lib/python3.6/**)|mnsd 19 in 1131.1 +- 1.5 microseconds via lcpRange $ (repeat 10 lcp -alcpB /usr/lib/python3.6/**)|mnsd 19 in 1523. +- 17. microseconds via lcpBinSearch Run The L3 on an i7-6700k is 8MB. So, this is showing the range method pulling ahead while still in L3 (3.1 MB). At one pass, it really has to be approximately the fastest at large scale. Reading from DIMMs the difference should be even more pronounced in favor of range. It's possible that that marks/vertical scan style might be faster if the answer is very small. E.g., for a zero common prefix length, the answer will be concluded after just one pass down the first column via `lcpVertical`.
Re: Walking trees without recursive iterators
Well, this really does not address @spip's question at all, but it might be something @sschwarzer could use as an example for a multi-way/k-ary tree iterator. The iteration itself (not via `iterator`, but via `Path` also should support deletion/insertion/other tree surgery. Maybe this time @e won't find any bugs. Famous last words. ;-) type #k-ary search tree (like a B-tree) NodeOb {.acyclic.} = object data*: seq[int] #.data.len > 0 kids*: seq[ref NodeOb] #.kids.len == .data.len + 1 Node* = ref NodeOb#.. Also, .kids.len==0=>LEAF Path* = seq[tuple[p: Node, j: int]] proc newNode*(data: openArray[int]): Node = result = Node.new #Convenience Node builder result.data.setLen(data.len) for i, x in data: result.data[i] = x proc seek_most*(path: var Path, t: Node, lower=true) = if t == nil: return#Extend path from t to least|.. var t = t #..|greatest-most kid per lower while t.kids.len > 0: path.add (t, if lower: 0 else: t.kids.len - 1) t = t.kids[path[^1].j] path.add (t, if lower: 0 else: t.data.len - 1) proc seek_most*(t: Node, lower=true): Path = result.seek_most(t, lower) #Conveniene Path builder proc seek_succ*(path: var Path) = # path -> successor if path.len == 0: return#Just for safety if path[^1].p.kids.len > 0: path[^1].j.inc path.seek_most(path[^1].p.kids[path[^1].j], true) return while path[^1].j == path[^1].p.data.len - 1: path.setLen(path.len - 1) #discard path.pop if path.len == 0: #path @beg; No more succs return path[^1].j.dec path[^1].j.inc proc seek_pred*(path: var Path) = # path -> predecessor if path.len == 0: return if path[^1].p.kids.len > 0: path.seek_most(path[^1].p.kids[path[^1].j], false) return while path[^1].j == 0: path.setLen(path.len - 1) #discard path.pop if path.len == 0: #path @beg; No more preds return path[^1].j.dec iterator forward*(root: Node): tuple[node: Node; j, depth, val: int] = var path = seek_most(root, true) while path.len > 0: yield (path[^1].p, path[^1].j, path.len, path[^1].p.data[path[^1].j]) path.seek_succ iterator reverse*(root: Node): tuple[node: Node; j, depth, val: int] = var path = seek_most(root, false) while path.len > 0: yield (path[^1].p, path[^1].j, path.len, path[^1].p.data[path[^1].j]) path.seek_pred import strutils proc `$`(n: Node): string = cast[uint](n).toHex let spaces = repeat(' ', 80) proc print*(n: Node, depth=0) = if n == nil: return for i in 0 ..< n.data.len: if i < n.kids.len: n.kids[i].print depth + 1 echo spaces[0..<2*depth], $n, "(",i,"): ", n.data[i] if n.kids.len > 0: n.kids[^1].print depth + 1 var root = newNode([100, 200, 300]) # These values root.kids.add newNode([20, 40, 60]) #..make this a root.kids.add newNode([120, 140, 160]) #..full B-tree root.kids.add newNode([220, 240, 260]) root.kids.add newNode([320, 340, 360]) root.print echo "" for (p, j, depth, val) in root.reverse: echo "R: ",spaces[0..<2*depth], $p, "(",j,"): ", val echo "" for (p, j, depth, val) in root.forward: echo "F: ",spaces[0..<2*depth], $p, "(",j,"): ", val Run
Re: A path commonPrefix function
(One could probably inline the range work into the lcp and keep track of just the first character and exit early if it's ever different to combine the best properties...)
Re: A path commonPrefix function
For explicitness, proc lcpLenVRange(strs: openArray[string]): int = if strs.len == 0: return -1 # raise? var minAt, maxAt: int for i in 1 ..< strs.len: if strs[i] < strs[minAt]: minAt = i if strs[i] > strs[maxAt]: maxAt = i if strs[i].len > 0 and strs[i][0] != strs[0][0]: return 0 for i, c in strs[minAt]: if c != strs[maxAt][i]: return i return strs[minAt].len Run Integrating that into the test harness appropriately and running on my two running data input examples (extended to a case with zero length common prefix), and switching to `mnmx` (which gives the range of times instead of mean+-sdev) then gives: $ for a in lcpVe lcpR lcpB lcpVR; { (repeat 20; lcp -a$a /usr/lib/python3.6/**)|mnmx } 19 in 2300.262..2448.32 microseconds via lcpVertical 19 in 1122.236..1162.767 microseconds via lcpRange 19 in 1523.256..1558.781 microseconds via lcpBinSearch 19 in 1189.47..1224.756 microseconds via lcpVRange $ for a in lcpVe lcpR lcpB lcpVR; { (repeat 20; lcp -a$a /usr/bin/*)|mnmx } 9 in 51.737..76.532 microseconds via lcpVertical 9 in 48.637..68.903 microseconds via lcpRange 9 in 36.24..40.054 microseconds via lcpBinSearch 9 in 46.253..54.598 microseconds via lcpVRange $ for a in lcpVe lcpR lcpB lcpVR; { (cd /usr/bin; repeat 20; lcp -a$a *)|mnmx } 0 in 3.099..5.722 microseconds via lcpVertical 0 in 43.392..80.109 microseconds via lcpRange 0 in 8.583..13.113 microseconds via lcpBinSearch 0 in 3.099..5.245 microseconds via lcpVRange $ for a in lcpVe lcpR lcpB lcpVR; { (cd /usr/lib/python3.6; repeat 20; lcp -a$a **)|mnmx } 0 in 7.868..11.683 microseconds via lcpVertical 0 in 911.713..959.158 microseconds via lcpRange 0 in 174.522..190.735 microseconds via lcpBinSearch 0 in 9.537..14.782 microseconds via lcpVRange Run So, you can get the early exit of the vertical method for zero common prefices in the range method without very much extra cost. Eg., slightly negative/in-the-noise cost for the L2 prefixLen==9 case and 6% slower for the prefixLen=19 L3 case with the early exits basically matching times in the two prefixLen==0 cases. (and, yeah, yeah..the Nim program should `import stats`, do the repeated loops internally updating a `RunningStat`, and report. It could even take an `lcpAll` enum to report on all algorithms..and -- if one wants to commit to only file tree inputs -- maybe even take just a directory as a command parameter, `import oswalkdir`, and use `walkDir` and `walkDirRec` instead of `*` and `**` to remove dependency on shell features.)
Re: A path commonPrefix function
Things in that last version weren't robust/efficient with regards to zero length strings being in the mix. This fixes that: proc lcpLenVRange(strs: openArray[string]): int = if strs.len == 0: return -1 # raise? if strs[0].len == 0: return 0 # ANY "" anywhere => 0 let byte0 = strs[0][0] var minAt, maxAt: int for i in 1 ..< strs.len: if strs[i].len == 0: return 0 if strs[i] < strs[minAt]: minAt = i if strs[i] > strs[maxAt]: maxAt = i if strs[i][0] != byte0: return 0 for i, c in strs[minAt]: if c != strs[maxAt][i]: return i return strs[minAt].len Run In the interests of being mostly self-contained, I also pushed a better `timeIt` to `cligen#head` and updated my driver to use the new `timeIt` and to **not** time the `stdout.write` parts: proc timeAlgo(algo=lcpVRange, reps=1, strs: seq[string]) = var x: int timeIt(stdout.write, "", 1e-6, 3, " usec via " & ($algo)[3..^1], reps): x += strs.lcpLen(algo) echo " ", int(x.float / reps.float + 0.5) Run Finally, I created a `benchLcp.zsh` script: #!/bin/zsh lcp=`pwd`/lcp #We need Zsh only for its ** echo "L3 data, 19 answer" for a in lcpVe lcpR lcpB lcpVR; do $lcp -r10 -a$a /usr/lib/python3.6/**; done echo "L2 data, 9 answer" for a in lcpVe lcpR lcpB lcpVR; do $lcp -r10 -a$a /usr/bin/*; done echo "L3 data, 0 answer" for a in lcpVe lcpR lcpB lcpVR; do (cd /usr/lib/python3.6; $lcp -r10 -a$a **); done echo "L2 data, 0 answer" for a in lcpVe lcpR lcpB lcpVR; do (cd /usr/bin; $lcp -r10 -a$a *); done Run and used a little `nim-pgo` wrapper I have to use gcc's profile guided optimization (`-fprofile-generate` and `-fprofile-use`), compiling, running a test/bench, then re-compiling. With all that, I get: L3 data, 19 answer 1910.448..2253.771 1959.491 +- 32.812 usec via Vertical 19 686.884..1150.131 777.125 +- 48.653 usec via Range 19 922.680..1522.064 996.113 +- 58.569 usec via BinSearch 19 693.560..1163.244 786.591 +- 49.932 usec via VRange 19 L2 data, 9 answer 41.246..63.181 45.848 +- 2.480 usec via Vertical 9 37.193..54.359 40.936 +- 1.778 usec via Range 9 22.650..25.511 23.413 +- 0.353 usec via BinSearch 9 36.716..55.075 41.509 +- 2.292 usec via VRange 9 L3 data, 0 answer 0.000..0.954 0.167 +- 0.094 usec via Vertical 0 535.488..1017.570 627.780 +- 48.740 usec via Range 0 58.651..205.994 82.874 +- 15.300 usec via BinSearch 0 0.715..3.576 1.097 +- 0.278 usec via VRange 0 L2 data, 0 answer 0.000..0.000 0.000 +- 0.000 usec via Vertical 0 32.663..38.862 33.927 +- 0.650 usec via Range 0 3.099..4.768 3.290 +- 0.166 usec via BinSearch 0 0.000..0.238 0.048 +- 0.032 usec via VRange 0 Run which makes total sense to me. The @marks performance discrepancies (except his mysterious `foldl` bit), probably just come from the "mix"/averaging of how many zero length answers are in his inputs. So, this new variant of that VRange hybrid method would make the most sense to recommend for general usage (of the above selection, anyway). It's never that slow, fastest at large scale when performance matters most, and can leverage early exit in easy cases. The binary search way for smaller scale data may be best if the caller already knows and can pass in a data scale parameter. If we must measure the data size, that takes a whole pass through the length fields. That measurement eats into the 36/22=1.6x L2-non-zero speed ratio, probably reducing it to like 1.3x which isn't that much a boost for all the complexity of measurement+algorithm switching. I got about a 1.15x boost from PGO and most probably skip that entirely to avoid build complexity.
Re: Introducing --gc:arc
On Linux 5.4.2, `gcc 9.2.0 -march=native -O3 -fno-strict-aliasing` , head of devel (...ea39524bac14f6c89de5213509f4099) with a similarly tweaked run-5-times variant (using `epochTime` instead of `getTicks`), I see no discrepancy in elapsed time in `--gc:arc -d:release` and `--gc:arc -d:danger`. It does seem maybe more variable with only `-d:release`, but I have not studied it very systematically. I get numbers like 0.07 milliseconds per 1 loop of the 5 both ways..min == max, even. It does vary a lot from run-to-run of the program though, with both release & danger modes. I suspect we need a more reliable way to benchmark before being too worried.
Re: Introducing --gc:arc
One other thing I can say that may be helpful is that when you do the "max time of N samples" you shift the probability distribution (not density) by an _exponent_ of N. See [https://en.wikipedia.org/wiki/Extreme_value_theory](https://en.wikipedia.org/wiki/Extreme_value_theory) for details. Basically, if the latency has distribution F(t), the worst of 5 has distribution F(t)**5. The median is defined as dist=0.5. So, the median of the worst of 5 corresponds to 1-0.5**5 = 96.875th percentile time. Similarly, the median of the min time of 5 runs would be the 3.125th percentile of the base time distribution. So best/worst of 5 you expect (in a median sense) to sample the lower/upper tail of the base distribution. A lot of the time in benchmarking one takes the min of N to filter out all the noise from other things going on in the system. Unlike many things in statistics these results do not rely on anything about the shape of F(t)..It's just basic probability theory. What this means in this benchmark example is that it is going to be very sensitive (by construction if not intent) to **a lot** of competing activity on the system - network interrupts, competing processes, OS kernel goings on, L2/L3 cache evictions, etc. So, it is "noisy by design". The bigger N the noisier the max will be. For example, if you have a Winchester hard drive on the system and you make N big enough you will sample some suspension of the process, CPU migration, total re-warm up of L2 cache, and however long the disk service takes. None of that really relates _directly_ to "memory manager performance". It's more measuring how loaded your system is. Anyway, to me this explains the high run to run variability/difficulty of interpretation/difficulty to reproduce here. 97th percentile is pretty high and that's just the median. The tail of F(t)**5 is even worse than the median. And those tail-tail events are ever less relevant to comparing memory managers. To the extent they are of interest, because they are so very noisy, it's going to require a lot of sampling to assess them. The truly worst-worst-worst cases where everything is cold cache are actually pretty hard to measure for a variety of reasons. A lot of the time "hot loop benchmarks" can be **_very_** misleading about worst case latency and even pros mess this up _all the time_. I could provide a couple of fun examples sometime. Memory is mostly about DIMM/L3/L2/L1 latencies and bandwidths, though. You can measure those and reason from there more by looking at code and "calculating" from those measured constants than by timing it. Often cold cache DIMM hits are the most important thing if you aren't moving a lot of memory around. Then number of copies if you are moving. { This is why I like linear probing hash tables (with|without only local Robin Hood reorg which can help insert-heavy workloads). You just know it's basically just one 64-128 byte cache fill for really big tables when it's most painful, and you can't do better than 1. }
Re: Introducing --gc:arc
TL;DR - min of N is bounded below by the true, necessary time to do some calculation and can substitute for a "best case analysis". On the other side, max of N times on a multi-tasking system is not bounded by anything and polluted by everything. The upper tails mostly measure things that probably "do not generalize" to other systems (or even the same system at other times). max of N times cannot substitute for worst case analysis and mostly creates confusion which I suspect is the case here.
Re: Introducing --gc:arc
Also, I just noticed that the `worst` in Araq's benchmark is the "max of `msgCount` == 1 million trials", effectively. @miran did not mention if he was careful to re-initialize `worst` to zero in his "slightly modified" benchmark either. So, it might be more like 5 million. As should be clear from above, the max of millions is just begging to capture system noise not the effect in question, and I would not expect very consistent results over runs/systems/or even over time on the same system. Here is a perhaps interesting example of latency measurement. The basic task is measuring memory latency. One might **_think_** just accessing a random element of a large array that fits in available RAM is a way to do this, but one would be **_wrong_** at least in a "hot loop" setting. Anyway, not that it's rocket science, but the program also exhibits a way to measure some low quantile like min of N (itself a statistical estimate and so in this case averaged/sdev'd): import random, times, strutils, stats, cligen proc prepRanElt(x: var seq[int], n: int) = for i in 0 ..< n: x[i] = rand(9) #9 keeps sums short #Optimized for speed not randomness to maximize delta var r = initRand(123) proc runRanElt(x: seq[int], nAcc: int): int = let mask = x.len - 1 #XXX (1 shl floor(lg(x.len-1 for i in 1..nAcc: result += x[r.next.int and mask] proc prepShuffle(x: var seq[int], n: int) = for i in 0 ..< n: x[i] = i#Pop seq with identity x.shuffle #..perm & then shuffle proc runShuffle(x: seq[int], nAcc: int): int = for i in 1..nAcc: result = x[result] proc fmt(x=0.0, n=3): auto=formatFloat(x, ffDecimal, n) proc time(prep, run: auto; n, nAcc, avgN, minN: int) = var dtMins: RunningStat var s = 0 #Block optimizing away all work by use var x = newSeq[int](n) x.prep n for avgIt in 1..avgN: var dtMin = float.high for minIt in 1..minN: let t0 = epochTime() s += x.run(nAcc) let dt = (epochTime() - t0) * 1e9 / nAcc.float dtMin = min(dtMin, dt) dtMins.push dtMin echo "KiB: ",n shr 7," ns/Access: ", fmt(dtMins.mean), " +- ",fmt(dtMins.standardDeviationS)," s:", s type Algo = enum ranElt, shuff proc lat*(kind=shuff, sizeKiB=1048576, nAcc=1_000_000, avgN=4, minN=4, seed=0) = ## Time latency two ways. One does NOT measure it! if seed > 0: r = initRand(seed) else: randomize(); r = initRand(rand(10)) let n = (sizeKiB shl 10) div int.sizeof if kind == shuff: time(prepShuffle, runShuffle, n, nAcc, avgN, minN) else: time(prepRanElt, runRanElt, n, nAcc, avgN, minN) dispatch(lat, help={"kind": "shuff: chase ran perm\n" & "ranElt: access ran elt", "seed": "0=>random, else set" }) Run I get output like this for (1 shl 20) KiB == 1 GiB working set: shell$ memlat -kr; memlat -ks KiB: 1048576 ns/Access: 8.197 +- 0.002 s:71968628 KiB: 1048576 ns/Access: 67.330 +- 0.078 s:1295362736 Run I am pretty sure that what is going on is that the work to make the next pseudorandom number and the subsequent cache line load is predictable. So, speculative execution is letting the CPU "work ahead" of the memory system by many loops, and its coordinated prefetching then masks the memory system latency quite well, yielding a kind of "amortized/fake latency". Meanwhile the CPU cannot predict the bouncing randomly around the array loads because the next hop isn't known until the last hop is done. So, it has to actually wait on the DIMMs and you get to see the real round-trip latency. This may all seem "off track/topic", but many benchmarks have the pattern of the former rather than the latter and then fail to represent real world performance creating confusion. For example, hash table benchmarks often look _just_ like the above random element case. One example of pros getting that wrong, Table 1 in Kraska 2017 ([https://arxiv.org/abs/1712.01208](https://arxiv.org/abs/1712.01208)) has crazy good numbers like 31 nanoseconds for hash lookup latency. I don't think anyone makes DIMMs with anything **_near_** that kind of latency and if those were being used it would require special mention (not that the evaluation sections of those authors ever stray too close to reproducibility or analyzability...). Anyway, that 31 ns is almost surely amortized/fake latency. I think even all the blogger follow-up to Kraska2017 failed to observe this particular issue. Hardware manufacturers worsen this whole situation by lying about/quoting amortized instead of true round trip latency (what "latency" has meant my whole life anyway). Another comment about that benchmark code is that even putting `rdtsc`/epochTime/e
Re: Introducing --gc:arc
One other thing you might do, and I hesitate to make this recommendation because it's still a decidedly difficult to reproduce scenario, is to take the min of 100 trials of the max of 1 or something like that. The trick there is making sure the "time of interest" for that max of 1 is short enough that the min has a > 1% chance of filtering out all extraneous "unwanted" system activity that might be sampled. There is other pressure wanting to make that time long - such as being able to probe L3/DIMM-resident behavior of things. You might not be able to get even a 1% chance of a quiescent system for the duration of that 1 trial thing. On unix, you could shutdown the window system/all the services and say, be in single user mode..Maybe even unplug the network cable and definitely don't make any keystrokes after you press ENTER. Presumably what you are after is "Worst case-ness of just GC related things". That might be a start to a measuring mode of assessing it instead of just "pondering". I'm not sure it's much better than pondering, though, and definitely deserves fair warning to would be reproducers.
Re: Introducing --gc:arc
Some guy wrote his own "measurement OS" for studying CPU behavior in light of Spectre type information leak vulnerabilities. I forget the name of it at the moment. I do not think he open sourced it, though. Anyway, if you don't have something like that handy and you don't want to chase performance ghosts, what I would recommend is an L-average of the M-min of the N-max (basically like my `memlat` code) where you tune the loops sizes to get small standard deviations in the outermost and N is probably adjusted based on L1/L2/L3/DIMM levels. If you use the smallest max loop that can measure what you're after that will "multiply", meaning smaller min and averaging loops will get you stable numbers. Even in single user mode with no network cable, **_and_** with low standard deviations in the outermost loop, depending on how many milliseconds the max loop takes, you might still be measuring "time you want" \+ "a not varying much scheduled time/L3 sharing impact of the max number of Linux kernel threads can line up against my program at once during my max loop". You may be able to use `chrt -r 99 myprogram` or `taskset` to minimize that. Chances are that if the inner max loop is pretty short lived, "not varying much competers" is unlikely, and so a small standard deviation is unlikely to be a false flag. Past that step, there is also a way to boot a Linux kernel with some CPU cores "taken out of scheduler availability" (`isolcpus`) and then when you use `taskset` to assign some process to this isolated core that is the only thing that runs on that core, period. That probably reduces the "what am I measuring?" uncertainty to virtual memory system and L3 sharing impacts. I doubt many/any Linux kernel threads are very VM/L3 intensive when in single-user mode. So, that's probably about as close as you could get, and actually probably pretty close. It's admittedly all a bit of a PITA, but probably less so than buying a real-time OS or writing your own. :-) Anyway, maybe everyone knew all this already, but it seemed worth saying in light of that code @miran linked to.