Re: Dictionary syntax

2020-06-24 Thread cblake
In other words, it seems "excessive" to you because you do not imagine that you 
would never use anything but the stdlib `Table` "behind" `{}`. Never switching 
out something like that is "famous last words" in some circles. :-) So, the 
verbosity guards against the regret of hard-coding.


Re: Dictionary syntax

2020-06-24 Thread cblake
The `{}` syntax is what the Lisp world calls an "association-list" and is more 
general than what the Python world calls a dictionary literal/constructor 
syntax. In Nim `{a:b,c:d, ...}` is just syntactic sugar for `[ (a,b), (c,d), 
... ]` (or `@[]` if you prefer a seq analogy, but it's at compile time when 
either seq or array lengths are "known").

This is a good in that you can use the same syntax for multiple back-end 
associative array implementations (like a simple sorted array, trees, alternate 
hash tables, etc.). In a sense the syntax is only loosely coupled to the 
semantics here. The cost is the slight extra verbosity of tacking on a 
`.toTable`, `toSortedArray`, `.toBST`, `.toBTree` or whatever which is deemed 
an acceptable trade off for the generality. Nim is not Python and you do have 
to put types/type-related things in various places.


Re: NvP: s.add('x') 100M times

2020-06-24 Thread cblake
For this particular benchmark `--gc:boehm` uses the least memory and time for 
me on nim 28510a9da9bf2a6b02590ba27b64e951a208b23d with gcc-10.1 and PGO but 
that least is still 2.5x the RSS of python-2.7.18. Not sure why, but yeah it is 
35x faster than Python.


Re: NvP: s.add('x') 100M times

2020-06-24 Thread cblake
I don't disagree. Might need delving into the generated C to figure out, but 
I'm guessing my results are not hard to reproduce. If they are let me know how 
I can best help.


Re: NvP: s.add('x') 100M times

2020-06-24 Thread cblake
Just did a non-PGO regular `-d:danger` run. Times went up 1.9x but memory usage 
patterns were the same with `gc:arc` using much more RSS than `gc:boehm` or 
`gc:markAndSweep`. It's a pretty tiny program.


Re: NvP: s.add('x') 100M times

2020-06-24 Thread cblake
@cumulonimbus \- I tried that. Didn't alter the behavior I was seeing.

If this behavior was not always there then my guess is that some arc bug was 
causing a crash, got fixed, and now the fix causes this. Regardless of whether 
it was always there or appeared by bug-jello-squishing accident as I theorize, 
we should probably have a little suite of "memory use regression" tests to 
prevent stuff like the scenario I described. Such a suite would be a kind of 
"correctness testing" for deterministic memory management. Could have a 
"fuzzy/ball park compare".

Maybe we have such already, perhaps informally? If so, we should add this 
`str1` to it. If not, it can be the first test. :-)


Re: NvP: s.add('x') 100M times

2020-06-24 Thread cblake
Yup. Just what I was seeing, @b3liever. No `main()`-difference to the RSS 
delta, and a very noticable delta in a non-intuitive direction. So, either our 
intuitions are wrong in a way which should be clarified or there's a problem 
which should be fixed. Maybe a github issue?


Re: NvP: s.add('x') 100M times

2020-06-24 Thread cblake
I don't see how your linked algo explains deltas across gcs if that `3 div 2` 
growth happens for all of them. The memory `gc:arc` uses here seems more like 
the sum of all prior allocations, not "up to 1.5x what's needed". Actually, 
`gc:arc` uses 1.25x more mem than `gc:none` (312MB) in a test I just tried.


Re: NvP: s.add('x') 100M times

2020-06-24 Thread cblake
I also usually find Py2 much faster than Py3. Pypy usually helps. Cython more 
so, but with much more work.

Anyway, the obvious better way to do it in Nim (which I always assumed was 
"never the point") is 


var s = newStringOfCap(100_000_001) # or whatever
for i in 0..100_000_000: s.add('x')
echo len(s)


Run

which runs 2x as fast as otherwise and uses exactly the right amount of memory. 
I mention it just in case @HashBackupJim was unaware.


Re: NvP: s.add('x') 100M times

2020-06-24 Thread cblake
@HashBackupJim \- `newSeqOfCap[T](someLen)` also exists and, yes, pre-sizing 
can help a lot in Nim (and almost any lang that supports it).

Profile-guided-optimization at the gcc level can also help Nim run timings a 
lot..In this case 1.6x to 1.9x for various `gc` modes. 
[https://forum.nim-lang.org/t/6295](https://forum.nim-lang.org/t/6295) explains 
how. LTO also helps since most of the boost of PGO is probably from well chosen 
inlining.

@sschwartzer \- not only string benchmarks...Interpreter start-up/etc. Anyway, 
this isn't a Python forum, and benchmarking "always depends". :-) :-)

Someone else should reproduce my `--gc:arc` uses more memory than `gc:none` for 
the original str1.nim or one with a `main()` (or both). I think this kicking 
the tires has probably uncovered a real problem.


Re: NvP: s.add('x') 100M times

2020-06-25 Thread cblake
The same problem happens on nim-1.2.0 (well, nim-devel git hash 
ed44e524b055b985b0941227a61f84dd8fdfcb37). So, this is a long-lived, maybe 
since the beginning behavior of `gc:arc` (but we should still have a memory 
overuse regression test). Probably time to look at generated C.


Re: NvP: s.add('x') 100M times

2020-06-25 Thread cblake
One other Nim-level thing I can say is that things work as expected for 
`seq[int]` of the same memory scale (100MB). I.e., 


proc main() =
  var s: seq[int]
  for i in 0..12_500_000: s.add 1
  echo len(s)
main()


Run

produces a memory report (using /usr/bin/time on Linux) like: 


187MB seq2-arc
250MB seq2-default
250MB seq2-msweep
265MB seq2-boehm
300MB seq2-none


Run

So, this problem is **_only_** for Nim `string`. Indeed, if one changes 
`string` to `seq[char]` in the original example, usage goes down to 139MB, 
roughly what one would expect for a 3/2 growth policy.


Re: NvP: s.add('x') 100M times

2020-06-25 Thread cblake
I was mistaken. I was compiling my `seq` test with `-d:useMalloc` which fixes 
the problem. Sorry..fiddling with too many knobs.

`string` and `seq[char]` behave identically with `gc:arc` and both get fixed 
(139MB) with `--gc:arc -d:useMalloc`. Other GCs (including `none`) still beat 
`gc:arc`-without-`useMalloc` on `string`. However, other gc's spend more memory 
(like 420MB) than `gc:arc` on `seq[char]`. So, whatever is going on, `seq` is 
actually worse than `string`, not better. { But also a side note for 
@HashBackupJim to try `-d:useMalloc` with `--gc:arc`. }

At this point, I should raise an issue over at Github. I'll link it back here.


Re: NvP: s.add('x') 100M times

2020-06-25 Thread cblake
Github issue: 
[https://github.com/nim-lang/Nim/issues/14811](https://github.com/nim-lang/Nim/issues/14811)


Re: Why does wrapping the code in a top level procedure make it faster?

2020-07-03 Thread cblake
@Stefan_Salewski \- The idea of defaulting to an optimizing mode has come up 
before. It seems contrary to what almost any user coming from a "compiled 
language" world would expect. For various reasons (debugability, compile-time 
performance, etc.), almost all compilers default to a non-optimized (or very 
weakly optimized) output and allow various knobs to crank up optimization, as 
does the current `nim` setup.

There is even a whole dark art of "best optimization flags" which can be taken 
[to severe extremes](https://github.com/Acovea/libacovea). More 
simply/efficiently/some might say intelligently, you can often use PGO 
[https://forum.nim-lang.org/t/6295](https://forum.nim-lang.org/t/6295) to get 
1.25..2.0x boosts on object code generated from nim-generated C. Some flags 
like `-ffast-math` can even change semantics in subtle ways that can impact 
correctness or not depending on use cases.

I don't know what to do about people publicizing misleading benchmarks. That 
seems an ineliminable hazard, not only for Nim, but actually **_everywhere and 
all the time_** , and often not on purpose (though most "marketing" wants to do 
strawman comparisons on purpose). Besides compiling wrong, they could also use 
a bad algorithm, misrepresentative inputs, weird machines, benchmarks naive 
relative to intended measurements, and probably several other mistake 
categories. :-/ The best strategy may be trying to be supportive where & when 
we can, educating as we go, though I agree/admit that is a neverending battle.


Re: Why does wrapping the code in a top level procedure make it faster?

2020-07-03 Thread cblake
Ah. Sorry that I misunderstood the sarcasm. Maybe my content/xref was not 
worthless anyway. ;-)


Re: Why does wrapping the code in a top level procedure make it faster?

2020-07-03 Thread cblake
As the person who added `rightSize`, I agree with those points actually. The 
only slight gotcha is that doing as you suggest would result in old code that 
calls it with (the correct power of two to avoid a resize, or maybe already a 
`rightSize`) would allocate the _next higher_ power, wasting space. That might 
be a price worth paying ease-of-use wise.

We _could_ only call `rightSize` if the argument is not a power of two, but 
it's probably simpler to just always use `rightSize` and just tell people. 
Wasting space isn't quite a "failure". Anyway, you should feel free to file a 
github issue and/or work on a pull request. Maybe I should have pushed harder 
at the time to change the parameter semantics.

There is definitely a learning curve coming from other languages. There are 
definitely more examples of the stdlib not providing the "best in class" for 
various functionalities. There will likely be even more with time. Nim core 
only has so much manpower to write/maintain/support such things.

We discussed doing a "distribution" over at 
[https://github.com/nim-lang/RFCs/issues/173](https://github.com/nim-lang/RFCs/issues/173)
 and elsewhere. This "fusion" repository (see end of that thread) seems to have 
been the output, but not much has been happening over there.

Anyway, I think if you persevere there are good chances you could be a happy 
Nimmer once you've learned your way around. You seem pretty capable of finding 
these issues and you could probably help round some of the sharp edges, if you 
can sustain the patience. There is probably a Porting to Nim From Python 
guidebook or maybe you could help do one!


Re: Why does wrapping the code in a top level procedure make it faster?

2020-07-03 Thread cblake
In my experience batteries are almost never fully charged and it's hard to get 
feedback if you only release a perfect cathedral. With no feedback, it's kind 
of unlikely you will build something popular.

To take a topical example, even after 30 years of tuning, the core language 
`dict` in Python still seems to have no easy way to "pre-size" an instance. 
There is no power of 2, no `rightSize`, nuthin'. So, one of your rough edges 
literally cannot arise due to an inflexibility/API design flaw (IMO).

Yes, there must be 3rd party replacements or ways to pre-size in some slow 
subclass or whatever, but you could also just write a `proc initTab` that 
always calls `rightSize` for your own code. What's at issue here is "out of the 
box" (and personally I think the Nim workaround is less onerous than 
workarounds in the Python world).

Do you have to learn your way around? Sure. A sales pitch of "just like Python 
but compiled/with native types" is definitely not quite right. That's 
Cython/similar. Analogies and oversimplifications help some people while others 
want the real story, and you strike me as the latter.

Nim gives programmers more power and flexibility, but responsibilities also 
come with that. Cue Spider Man's Uncle Ben. ;-) It is yet a comparatively tiny 
community. Bugs, workarounds, and rough edges come with that. Unused patterns 
like your `&` are just unused, unoptimized things waiting to be 
discovered/fixed. There was a time when `C` had no proper namespacing of struct 
members and that is why some core Unix types have prefixes like the `st_*` in 
`struct stat` (run "man 2 stat" on any unix machine).

No one using Nim _wants_ it to be hard, but there may also be good reasons some 
things are the way they are (often flexibility, performance, or safety, but yes 
sometimes neglect). I'm really sorry to hear your entry has been tough, but the 
flip side of that is you could probably make a HUGE impact to future 
similar-but-not-quite-you's, and I again encourage you to try! Even just 
documenting everything somewhere would probably help at least a few other 
Python Refugees. :-) Cheers and best wishes whatever you decide.


Re: Introducing --gc:arc

2020-07-05 Thread cblake
You should also try PGO both with gcc and clang. 
[https://forum.nim-lang.org/t/6295](https://forum.nim-lang.org/t/6295) { yes, I 
know you posted that. :-) xref for others. :-) }


Re: Python transpiler

2020-07-13 Thread cblake
@ShalokShalom can perhaps clarify what he meant, but I took him to be asking 
for a `c2nim`-like program `python2nim` to help facilitate porting pure Python 
code to pure Nim code (as opposed to using already written Nim code from within 
Python or already written Python code from within Nim). Some runtime libraries 
to bridge API differences might also help along those lines. Some quick web 
searchers turn up a few partial moves along this direction.


Re: Python transpiler

2020-07-13 Thread cblake
I think part of the "sales pitch" for Nim out in the world, for good or ill, is 
"sorta like Python but compiled with static types". It isn't so strange that 
people show up asking how to automate "sorta like" into "more so". ;-)

The vast majority of Python code that I have seen uses precious little of 
Python's very dynamic nature. So, doing a half-baked py2nim with a **_slew_** 
of caveats might have some value. It's not like `c2nim` is perfect, either. The 
questions are more A) if the caveats are beyond the grasp of the typical person 
porting such code and B) if the rather open-ended nature (ever changing Py3) 
would leave it half-done forever. @juancarlospaco may be right that a group 
effort here could yield the best results.

It isn't my question to answer, but as to "why?", it is likely "easier good 
performance". Python unassisted by C/Cython/Pythran/etc. or Pypy is one of the 
slowest programming languages around. Combined with piles and piles of Python 
code out there doing xyz useful thing that could be safer/faster/etc. with Nim. 
Nim just invoking/calling that code is one way to access functionality. If it 
was too slow to begin with, it won't help. People may have misplaced 
expectations about how little they have to think in order to improve 
performance, of course.


Re: Macro to create a dictionary (table) like in python!

2019-07-10 Thread cblake
I would have to agree with `@`. +1 vote.


Re: Macro to create a dictionary (table) like in python!

2019-07-12 Thread cblake
These association lists, often abbreviated "alist", get a lot of play in lisp, 
but also OCaml, Haskell, etc.. 
[https://en.wikipedia.org/wiki/Association_list](https://en.wikipedia.org/wiki/Association_list)


Re: Hyphens Not Allowed in Nim Filenames? [Invalid Module Name]

2019-07-18 Thread cblake
I think it would be good to allow the quoted form for consistency's sake. Also, 
there should probably be a `{.out: "whatever-name"}` like the `{.passC.}` etc. 
pragmas to control the output name from within the source file. (Maybe there is 
now, but last I looked there wasn't. Maybe `outFile` or `outExe` are better 
names.)


Re: Hyphens Not Allowed in Nim Filenames? [Invalid Module Name]

2019-07-18 Thread cblake
(consistency with what is possible for identifiers in other contexts, in case 
my intent wasn't clear..)


Re: Hyphens Not Allowed in Nim Filenames? [Invalid Module Name]

2019-07-19 Thread cblake
Saying you would recommend against use without saying why you would and then 
saying that alone is a good enough reason is not much of an argument, actually.

So, why would you recommend against its usage? Kebab-case is quite popular for 
program names in the Unix world. Fully 20% of my `/usr/bin` programs have a 
`'-'` in them (about 3X the number that use underscore). (These are not 
programs named by me.) Also popular is having the name of the program match the 
name of the source file up to an extension. So, if you mean to say this 75% 
super majority of dash vs underscore program namers is so wrong-headed, what 
exactly have they got wrong?

As for "dealing with code" that uses it, use in library/module names as opposed 
to program names is unlikely to be common. That said, tossing backticks on 
there doesn't sound very much of a burden..It might be less onerous in some 
more system-wide sense than having a "library version" and a "program version" 
or symlinks or other kinds of aliasing some people might concoct to work around 
the problem.

I think having a `{.out:.}` pragma to let a program set its own (default, 
overridable on the `nim c` command-line) name is a good step, too. That might 
solve most people's problems. It would still seem lamely inflexible that module 
names are more restricted than other identifier names (procs, vars, etc.).


Re: What do you think about the programming language NIM?

2019-07-31 Thread cblake
@mratsim's code may not trigger it, but at least in `/devel` there seems to be 
a check in `semexprs.semOverloadedCallAnalyseEffects` that errors out with 
`errRecursiveDependencyIteratorX` ("recursion is not supported in iterators: 
'$1'"). The error message was even updated just a couple months ago. It is not 
hard to trigger this erroring out with trees that are "run-time switched" 
recursions vs. @mratsim's compile-time switched recursion. I for one would love 
full recursion support in iterators (and so would love to be wrong, if I am).


Re: What text editor are you using for Nim?

2019-08-07 Thread cblake
Obviously, editing time can be very different than start-up time, but it's easy 
to benchmark start-up time and it's somewhat indicative. In the early 1990s 
`emacs` with non-trivial `.emacs` configurations could take several to many 
seconds just to start up and load a file while `vim` would start-up in the 
blink of an eye (like `< 100ms`). These days, `vim` has become bloated (in 
default configs) while `emacs` has gotten faster (especially with 
`emacsclient`, and maybe even more so if this JIT compiler ever takes off). So, 


$ utime vim -c quit nada
0.012548586 0 0   0.0%
$ utime emacs -q -nw --eval='(save-buffers-kill-terminal)'
0.017100606  0.01 0  58.5%


Run

or "both are pretty similar" (`12.5 ms` vs `17 ms`).

With `vim -u /dev/null -c quit` which is a little more fair since I had `-q` on 
the `emacs` you see more like `1.8 ms`. (`vim --startuptime` doesn't show many 
hot spots, but a lot of miscellanous loads.) However, with `emacsclient`, its 
time goes down to a similar `2.9 ms`, only you get a fully loaded environment. 
And the "busybox vi" start up is like 190 microseconds, but boy is that a 
spartan environment - more analogous to vim of the early 90s.

All these times, though, are "too fast to notice", as is the case with most 
everyday editing operations, unless there's some kind of loop. All probably 
have "scaling holes" they could patch up with some significant re-engineering. 
I mostly use `vim` and I've seen it grind to a halt with some really long lines 
in binary files. Developer time generally matters a lot more, though that often 
relies a lot on what one is "used to".

Anyway, I thought some numbers and command-lines that readers could try on 
their own configs might add some objectivity. (Well, that `utime` thing is a 
homebrew command of mine...Sorry. Needed the resolution. 
[https://github.com/gsauthof/cgmemtime](https://github.com/gsauthof/cgmemtime) 
will do `ms` resultion anyway.)


Re: What text editor are you using for Nim?

2019-08-07 Thread cblake
While I doubt I disagree at all with @amalek's editor recommendations (brief 
aside - kakoune actually distributes a `nim.kak` though I don't know how well 
it works), it's also worth pushing back on the idea that lisp needs to be slow. 
Interpreters do tend to be, and interpreters of dynamic languages even more so, 
and I don't think vimscript executes very quickly. Compiled lisp, like Nim (or 
even Cython), can with a little "`(declare)` care" be roughly on par with C, 
depending on what one is doing. Of course, different languages have more or 
less "performance fragility" for lack of a standard term.

In terms of people having emacs-extensibility with less steep performance 
tradeoffs, there is this nascent effort 
[https://tromey.com/blog/?p=982](https://tromey.com/blog/?p=982) It would be 
nice to see that go forward someday. Taking some 500 ms pause time down to 150 
ms might make all the end-user difference necessary to really improve that 
experience.


Re: What text editor are you using for Nim?

2019-08-08 Thread cblake
I should perhaps also have mentioned that if emacs' garbage collection behavior 
was the real pause-time culprit for @amalek then that may not be any better 
with a JIT since the problem may be about memory loads more expensive than 
interpreter CPU cycles. 
[https://cse.buffalo.edu/~mhertz/bc-pldi-2005.pdf](https://cse.buffalo.edu/~mhertz/bc-pldi-2005.pdf)
 has some answers, but only for the "last level caching" hard disks/SSD data in 
RAM. It's possible deeper cache (like L3/TLB) thrashing is behind some `emacs` 
GC pauses. It's likely that much of the perceived "slowness of Lisp" is about 
such GC-(hw|OS) impedence mismatches. That's not fundamental, though. Nim's GC 
usually works fine.

In my limited experience almost no elisp packages (or users!) try to tune GC 
behavior, though maybe recently that's been different 
([http://bling.github.io/blog/2016/01/18/why-are-you-changing-gc-cons-threshold/)](http://bling.github.io/blog/2016/01/18/why-are-you-changing-gc-cons-threshold/\)).
 My personal guess is that optimal settings are probably _dynamic_ , related to 
competition for L3, and should not be some fixed 800kB number. Maybe 
@Kaushalmodi knows? Anyway, editor performance tuning is a bit off the main 
thread topic.


Re: How to add a symbol to represent the infinity (∞)

2019-08-08 Thread cblake
I assume you realize that your second example will not get to "Year Zero" since 
it sure looks like you are requesting an infinite number of B.C. years?

Not sure you want to build an entire lazy evaluation system and symbology like 
Perl6..It's easy to do an infinite iterator by just doing your own iterator 
called `fromXtoInf` 
[https://nim-lang.org/docs/tut1.html#iterators](https://nim-lang.org/docs/tut1.html#iterators)
 (and you could put that unicode ∞ symbol in the iterator name if you want, but 
you might need to backquote the symbol). I'm really not sure, though, how many 
would think that is clearer than simply: 


number = 0
while true:
  number.inc
  ...


Run


Re: How to add a symbol to represent the infinity (∞)

2019-08-09 Thread cblake
@LeuGim \- nicely done & interesting approach!

I'm not sure I would have embued `inf(0)|Infinite(0)` with as precise semantics 
in your `for 3 .. inf(0): ...` construction. Infinity is more a process than a 
number. For `∞ - ∞ ~ inf(0)` to act like zero the two limiting processes have 
to precisely cancel out. In some abstract sense, that's much less { infinitely 
less? :-) } likely than being lopsided and the result being more like either 
`-∞` or `+∞`. Without more context there's not enough to know which (or zero). 
So, being undefined is probably a better choice.

Also, for triple bonus extra credit along these lines you could try to do the 
point at infinity in the complex plane/Riemann sphere. ;-) { 
[https://en.wikipedia.org/wiki/Riemann_sphere](https://en.wikipedia.org/wiki/Riemann_sphere)
 }


Re: Compile C file together with Nim one

2019-08-15 Thread cblake
Yet another approach is to have the `nim c`-generated C code just `#include 
"foo.c"` via `{.importc: "func", header: "foo.c".}`.


Re: What do you think about the programming language NIM?

2019-08-15 Thread cblake
I do not agree that lazy linear lists are the "main example" of recursion. They 
may be the _simplest_ example, thusly allow several easy workarounds. I 
mentioned "trees" right in my comment. Most of the (admittedly subjective) 
elegance of tree algorithms comes from the way recursive code structure mirrors 
recursive data structure. `lib/pure/collections/critbits.nim:iterator leaves` 
shows some ugliness required to workaround there, for example.

I agree it may not be a high priority or even in the top 10. I do not think it 
would be considered a "no real need" extra/extraneous fluff/"feature bloat". I 
know you didn't say that, exactly, but I felt your post risked leaving that 
impression.


Re: What do you think about the programming language NIM?

2019-08-15 Thread cblake
There are always workarounds since CPUs are little state machines. That should 
be news to no one. Recursion working makes code look much nicer, IMO. For what 
it's worth, the example considered among the more compelling by the Python guys 
back in the Python 2.3 days when they added generators was in-order binary 
search tree traversal: 


iterator inorder[T](t: Tree[T]): T =
  if t:
for x in inorder(t.left): yield x
yield t.label
for x in inorder(t.right): yield x


Run

which parallels the usual recursive call structure quite well. I think it would 
be great if the above Nim actually worked. Equivalent state machines are often 
ugly, trickier to get right, and a pain point (and yeah, usually faster). But 
"expressive" has always been a main goal of Nim. :-)


Re: What do you think about the programming language NIM?

2019-08-16 Thread cblake
Oh, I had meant to include `{.closure.}`. Oops. Fixed.

A stack (of something) is fundamentally required for general non-linear/tree 
cases. TCO can only help your linear cases (partly why I call them "easy"). 
Good trees have well bounded depth, but yeah, wildly unbalanced ones can cause 
trouble for **any** impl, recursive or not. A non-recursive one may minimize 
the per-stack entry storage a bit better, though.


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread cblake
I agree with @siloamx. I would also point out that for many less 
compiler/parsing-sophisticated programmers, splitting is conceptually simple 
enough to make all the difference. Such programmers may never even have heard 
of "lexing". This is just to re-express @Araq's point about it being "naive" in 
maybe a slightly more accomodating way. Probably, they should learn, but maybe 
they want to focus on some simple problem. Formats aren't always in a 
strict/pure table format amenable to `parsecsv`.

I have some routines in `cligen/mslice` 
([https://github.com/c-blake/cligen](https://github.com/c-blake/cligen)) that 
may help for this. Some example usage is in `examples/cols.nim`. In the `mmap` 
mode of `cols`, I get ~50% the run-time of GNU `awk` for field splitting, 
though I have admittedly never tried with 15,000 columns.

One final, related point is that `gunzip` can be very slow at decompressing and 
is single-threaded. If the average column width is ~20 Bytes, 300k*15k*20 =~ 90 
GB. Working with the uncompressed file directly or using something that can 
decompress in parallel like `Zstd` 
([https://en.wikipedia.org/wiki/Zstandard](https://en.wikipedia.org/wiki/Zstandard))
 may help **a lot** , especially if you have 4-12 idle cores as many do these 
days. On Linux, you should be able to just 


let f = popen("pzstd -cdqp8 < myfile.zs".cstring, "r".cstring)


Run

(You may have to, just once, `zcat file.gz | pzstd -f -p8 -19 > file.zs` and, 
of course, this will not help if you have a pile of `.gz` files only to be 
processed once.)


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread cblake
In fact, that VCF format does use `\` escapes 
([https://samtools.github.io/hts-specs/VCFv4.3.pdf)](https://samtools.github.io/hts-specs/VCFv4.3.pdf\)).
 So, basically all the above code examples are indeed wrong, and @Araq's advice 
is the right general advice (I did say they should probably learn).

That said, I also still think there are many situations both quick & dirty and 
otherwise where simple splitting makes sense and is not wrong. Why else provide 
it in `strutils`?


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-18 Thread cblake
On slightly closer inspection of that spec, it seems that backslash quoting 
only happens in the `##` comment sections ignored above. So, maybe they aren't 
buggy after all if that data is not important to the calculation in question.

A further point along the lines of "if we're going to provide `split`, maybe we 
should provide faster variants" is that the times when it is going to matter 
most will be for huge inputs where the columns may well have a more regular 
nature like numbers and specifically forbid more complex lexical structure. 
There it might A) be correct, B) performance might matter a lot in human terms, 
and C) the programmer might mostly be non-sophisticated with regard to even 
terminology like "lexing" or "vectorized memchr". This all seems to be the case 
with this VCF thing, but I feel like it's come up quite a few times over the 
years. It may not **always** be "bugs running faster". :-)

Anyway, I don't think "to force people to learn new terminology/techniques" is 
a very welcoming answer. So, I tried to provide something more welcoming. Even 
if their parsing is sloppy & error prone, I think naive programmers facing 
consequent errors on their own data sets rather than complaining about Nim 
library performance is better optics for Nim.

All that said, I think we agree 100% that we probably need more information 
from @markebbert to help him any more with his actual problem. Maybe it is IO. 
Maybe he didn't even compile with `-d:danger`. If he's on Linux, I would 
suggest him decompressing first and trying my `mmap` versions. 90 GB/(100 MB/s) 
=~ 900 seconds =~ 15 minutes. Heck, some people even have 90GB of RAM. :-)


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
FWIW, I suspect the answer to all this noise is that @markebbert was simply not 
using an optimized compile (as suggested by the very first line of the very 
first response to him).

@jyapayne \- what I did was go to 
[https://vcftools.github.io/index.html](https://vcftools.github.io/index.html) 
and download the distro and find the biggest VCF file in there 
(`contrast.vcf`). Then I did something like this: 


#!/bin/zsh -f
head -n46 contrast.vcf > head.vcf
tail -n+47 contrast.vcf > tail.vcf
hardTab=$'\t'
## `cols` is from `cligen/examples`
cols -c -d "$hardTab" {1..7} < tail.vcf > tail-more.vcf
## Above has 106 distinct rows. Probably diverse enough.
paste tail.vcf $(repeat 2500 echo tail-more.vcf) > tail-wide.vcf
cat head.vcf $(repeat 30 echo tail-wide.vcf) > big.vcf


Run

That last file is only about 1.5 GB, likely about 60x smaller than 
@markebbert's but should have otherwise similar statistics (15,000 columns) and 
fit in almost anyone's RAM. (It also compresses via Zstd to under 500 _kB_ due 
to the way it was synthesized and decompresses in under 1/4 of a second for 
me...)

Then I just ran his initial Python and Nim (dropping the gzip stuff and 
adjusting path names). I reproduced the unsurprising Nim debug-build slowness 
(11.8 seconds) with his Python running at 4.81 seconds. Then with `-d:danger` 
got the Nim running in 2.07 seconds about 2.3X faster than the Python.

Then, just for kicks, I did a version using libraries that I alluded to which 
re-uses the same two `seq` for column outputs and got the Nim to 0.984 seconds, 
almost 5X faster than his Python: 


import cligen/[mfile, mslice]

proc main() =
# var genotype: string
  var i = 0
  var cols, subCols: seq[MSlice]
  for line in mSlices(mopen("big.vcf")):
if line.len > 0 and line[0] == '#':
  continue
discard line.msplit(cols, '\t', 0)
for col in cols:
  if i >= 9:
discard col.msplit(subCols, ':', 0)
for fmt in subCols:
# genotype = $fmt #Py did not do anything here..
  break
  inc(i)
main()


Run

What further optimizations make sense, such as eliding many splits by bounding 
columns, using iterators rather than `seq`, etc., ultimately depend upon what 
further calculation he was intending to do with the parsed data.

Scaling up to his problem size, this last version would translate to about 1 
minute run time vs his initial 400 minutes. "In real life", likely 90+% of his 
time would be spent waiting on `gunzip`. Literally any compressor allowing 
parallel decompression (`pixz`, `pzstd`, etc.) or just fast single-threaded 
decompression like `lz4` would probably be much less of a pain point for him. I 
realize, though, that he may have piles of giant data files already "trapped 
behind gunzip" in ways beyond his control. Converting them may still help him 
if he has many repeated calculations to do and the disk space.


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
Well, I don't really have a blog. So, this is what you get. ;-) Someone else 
can, though. Ideally, just give me credit by linking back to here. Or if you 
can make any of that `cligen/mslice.*split*` faster then a PR is welcome.

As a slight update, storing `big.vcf` in `/tmp` (a tmpfs aka `/dev/shm` RAM 
filesystem) and doing profile-guided optimization with gcc (via a little 
`nim-pgo vsn2 ./vsn2` script I have around), I get about 15% improvement down 
to 0.85 s. Linux `perf` tells me about 58% of that time is just in 
`__memchr_avx2` which is already hand assembly tuned.

That is probably about as fast as you can get (at least on my Skylake 
generation CPU) in any language. You might be able to eek out another 10..20% 
or more if you hand-rolled everything in assembly and did just the right 
prefetches/branch predictor gaming. Or you might not. Parsing at 1.85 GB/s 
isn't so bad. For reference, on that machine single-threaded RAM bandwidth is 
about 32 GB/s, and as mentioned Zstd can spit out that compressed file about 3x 
faster { though that is multi-threaded over 4+ cores..So, it's actually 
slightly slower on a single-core basis }.

Anyway, I doubt there is any real Nim problem here. A better use of time might 
have been to just wait for @markebbert to respond to the very first line of the 
very first response.


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
I don't know. Thanks for the implicit compliments and all, but it's an awfully 
specific set of circumstances, not a general performance analysis even of VCF 
never mind DSV parsing. The performance will be closely related to how long 
various column substrings are. That will be specific to the data I just made up 
from a sample file to have something reproducible by others. Someone with real 
data files like @markebbert would be in a better position to say something 
interesting about VCF parsing. { I also continue to agree with Araq that for 
this kind of situation it would be better if someone who's read that PDF which 
I linked to more thoroughly than I did (or want to) wrote some careful VCF 
parser API in Nim. }

So, I'm not sure what the post would be "about" exactly except perhaps some 
implicit performance one-upsmanship of 
[https://nim-lang.org/blog/2017/05/25/faster-command-line-tools-in-nim.html](https://nim-lang.org/blog/2017/05/25/faster-command-line-tools-in-nim.html).
 Such one-upsmanship is often an almost never ending game of diminishing 
returns and benchmark debates that I don't have time/inclination to pursue 
further at this moment. Also, it's just code outside the stdlib anyway. Feels a 
little like someone else's yard to me rather than some specifically 
Nim-conventional way to do something. `cols.nim` is already small 45-line 
example program, and the above snippet even smaller. But by all means, if you 
or Kaushal want to write more about it and link back here then you should! - 
cheers


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
You're welcome.

The usual behavior for compilers is to generate slow code about as quickly as 
possible, maybe also with the best debuggability, and do things like add 
`-O`/`-O2` to get faster code generated more slowly. I don't think Nim should 
vary from that. I am not entirely sure how much speed-up comes from 
`-d:release`/`-d:danger` and just using the optimizing modes of gcc.

If people are coming from Python they are going to have to learn about 
compilation options sooner rather than later. Someone, like in this case, who 
seemingly thought to switch to an iterator but not to try higher compiler 
optimizations..Well, I don't know how to prevent that. Their mind is _almost_ 
in the right place. We can just wait until they respond to the 
did-you-do-d:danger as a first question before worrying, though. I'm only about 
90% sure that's even what happened in this case since we haven't heard back 
from @markebbert.

This `split` question also has come up a lot, eg.: 
[https://forum.nim-lang.org/t/1580](https://forum.nim-lang.org/t/1580) 
[https://forum.nim-lang.org/t/2282](https://forum.nim-lang.org/t/2282) 
[https://forum.nim-lang.org/t/3557](https://forum.nim-lang.org/t/3557) It may 
not always/often be the best approach but, it is an operation provided by 
literally virtually every programming environment. Even ANSI C has `strtok`. 
So, for good or ill, it's often a "first comparision". So, I think we probably 
should have re-usable `var seq[]` accepting overloads in the stdlib as well as 
the iterator in case whatever sample code needs random access to post-split 
members.


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
@jyapayne \- not according to him. :-) But your two messages clearly passed in 
flight. As Araq mentioned, that `gzipfiles` module may need some work. You 
should do it! Personally, I have been avoiding gzip since at least 2007. There 
are just much better algos & tools now across every dimension except perhaps 
"nostalgia". :-) I think gzip should be retired when & where it can, but I 
understand that it remains popular.

Also, I suspect doing a `popen` implementation for Windows in Nim's stdlib and 
just making that utterance I mentioned above more portable might provide higher 
long-term value. The library approach only really adds value over external 
commands when people have many small compressed files which is probably not 
that common a case. It's very easy to separate the concerns of "which 
compressor" with external commands.

@markebbert \- Consider migrating to a parallel compressor and especially 
decompressor if your colleagues/data stakeholders would approve. You may also 
want to be careful about list comprehensions in terms of memory footprint. E.g, 
for one L1-cachable row they may be fine, but for the whole O(100GB) file a bad 
idea. Oh, and Nim really can be as fast as C/C++/Fortran. You should give it 
some more time when you have a chance. In some ways it's even simpler than 
Python (e.g. UFCS).


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
Nim is much closer to one language to rule them all than most, and I think it's 
very amenable to "gradual learning". I think the more you put in the more you 
get out.

Also, my parallel-decompression-to-optimize-IO recommendations are language 
independent { they depend more on things like if you have RAM to just buffer 
the whole file and/or disk array/NVMe IO >> Zstd decompress output rates 
(~often 2 GB/s..several GB/s), etc. }. I realize that it can be a chore to 
re-encode terabytes of data, educate others why, and train them how to access. 
I expect it can take your 10 minutes to get through your data down to < 1 
minute, though (not counting re-compression time, of course). It may be another 
"surprising order of magnitude" for you, but it's pretty off topic. So, I'll 
shut up about it.

Anyway, good luck and open issues/ask questions in the forum if you have them!


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-19 Thread cblake
Just some numbers on my machine for that exact file to make earlier points more 
concrete: 


78 seconds to decompress with gzip (@jyapane must have a fast box!)
1598.59s (u) + 16.65s (s)=415.53s to recompress with pzstd -f -p8 -19

File Size   Name
26966704829 
ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf
1923680177  
ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz
165270307   
ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.zs

(So, the Zstd file is almost 12x smaller.)
3.68 seconds to decompress with pzstd -cdqp8


Run

So, 78 seconds vs 3.7s =~ 21X. Could be even bigger with more free cores. More 
than an order of magnitude time savings with also more than an order of 
magnitude in space savings, once the file is re-encoded.


Re: Can the return value of a proc be a variable marked {.global.} ?

2019-08-20 Thread cblake
While I agree with @mratsim basically across the board, to be fair to the 
proposal I think implicit was that there if the defaults switched there would 
be an easy way to activate debugging mode like `nim c -d:debug`. I think nim 
should be like all other compilers, though which is how it is currently.

I think the best course of action is just a new `Hint` that power users can 
turn off that tells newbies @mratsim's "Reminder". There could be two, even, 
one corresponding to d:release and one d:danger.


Re: Can the return value of a proc be a variable marked {.global.} ?

2019-08-20 Thread cblake
Well, maybe open an issue or do a pull request. It seems like a 5-15minute task 
for someone (plus probably waiting hours on the CI).


Re: Can the return value of a proc be a variable marked {.global.} ?

2019-08-20 Thread cblake
@Kaushalmodi's issue, while interesting, is fairly independent since it almost 
certainly is just about `gcc` and not, say, `clang` (and quite probably just 
some range of versions of `gcc`). And people can fiddle with their `-O` levels 
in `gcc.options.speed` in `nim.cfg/etc`. I don't think it really argues/advises 
against @mratsim's warning/hint solution.


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread cblake
Because various Zstd ratios are all so large, it helps in many practical 
circumstances more than choice of programming language (which often plays in 
the 2x-5x range). Continuing with that one data file example, with just 4 cores 
the output rate is 7.3 GB/s. On an otherwise idle 16 core system, I get 
decompress times of 2.26 sec => output rate of 11.93 GB/s (1 TB in 83 seconds, 
1 PB in a day).

These numbers are really much faster than almost all IO systems these days. 
(Not "all", but probably "all except crazy fat RAIDs of 4 or more NVMe's").

To keep that data pipeline fed, you only need 165 MB/2.26s =~ 73 MB/s which is 
also well within reach of most IO systems. So, getting that 7-12 GB/s vs gzip's 
paltry 345 MB/s decompress rate is quite achievable and 20X..34X speed boost vs 
gzip is compelling.

Assuming that compression ratio of 163:1 holds up for other VCF files, that 
petabyte of VCF output would only take up like 6 TB disk which a lot of people 
have lying around these days. And people like markebbert are apparently patient 
enough to wait for 7 hours. So, in some sense, petabyte-scale home computing 
has been around for half a decade now "on the down low for the somewhat 
patient".

I'm a little surprised it isn't more well known/widely discussed. Not 
everything compresses like gangbusters the way this VCF data does, though.


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread cblake
After `-d:danger -d:release` and gcc-9.2 on an i7-6700k at 4.8GHz this runs in 
about 46 seconds for me against the decompressed file in a RAM filesystem: 


import cligen/[mfile, mslice]
proc main() =
  for line in mSlices(mopen("big.vcf")):
if line.len > 0 and line[0] == '#':
  continue
var i = 0
for col in line.mSlices('\t'):
  if i >= 9:
for fmt in col.mSlices(':'):
# do something with $fmt
  break
  i.inc
main()


Run

Note that the i=0 should be moved down for the loop to be similar. The first 
`[9..]` Nim slice version was not quite converted correctly to the iterator 
version. That mistake propagated to @jyapayne's version. Mark may well have 
caught that already.

Also note that there was some kind of misunderstanding earlier about `maxSplit` 
helping a lot, but the code seems to want to parse all columns _except_ the 9 
early header columns.

Anyway, time beyond about 40seconds .. 1 minute of "parsing time" should just 
be IO. And that IO could be reduced to probably 4 seconds with Zstd, but may be 
more like 1.5 minutes with gzip. So, I might expect times somewhere in the 2-3 
minute range for @jyapayne's sample file.

The statistics of the two files sound pretty different. @jyapayne's eg file is 
is 3.5e6 \n chars and 8.7e9 \t chars while @markebbert reported (reportedly) 
300e3 \n chars and 4.5e9 \t (300k*15k). I might expect that @markebbert's might 
run faster, but his data columns may be larger.


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-20 Thread cblake
You're welcome.

@jyapayne \- Well, there is this 


import strutils, posix
proc main() =
  for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)):
if line.startsWith('#'):
  continue
var i = 0
for col in line.split('\t'):
  if i >= 9:
for fmt in col.split(':'):
# do something with $fmt
  break
  i.inc
main()


Run

which takes about 255 seconds (4m 15s) on @jyapayne's sampel file on my 
machine, and then if you are willing to have a cligen dependency you can get it 
about 4x faster (67 seconds) with this: 


import strutils, posix, cligen/mslice
proc main() =
  for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)):
if line.startsWith('#'):
  continue
var i = 0
let msLine = toMSlice(line)
for col in msLine.mSlices('\t'):
  if i >= 9:
for fmt in col.mSlices(':'):
# do something with $fmt
  break
  i.inc
main()


Run

Latter needs `popen` analogue on Windows (I bet there's one in the stdlib, but 
I always just use `popen`), and he'll have to `nimble install cligen` but he 
may want to anyway.

Another perhaps non-obvious note about the `popen` approach is that the 
decompressor runs as a separate process, and so in parallel if you have at 
least two idle cores. Hence, run-time tends to be max(decompress time, parse 
time) rather than sum() of the two times. (Bandwidth across a unix pipe is 
usually >15GB/s anyway.) Anyway, Mark could probably run that 2nd version on 
his file. It might be a little faster or slower, but is probably within 2x 
which is still 5x faster than Groovy. Might be more than 10x faster.

To answer some of @markebbert's compression questions..It's a lot to cover, but 
I'll try to be brief. The `.zs` or `.gz` or `.bz2` or `.xz` formats are all 
different as are the related compression algorithms, but some tools support old 
formats for compatibility. You won't necessarily get any performance boost in 
any dimension (time/space/etc), though. To get good parallel speed-up of 
decompressed output, the compressor must prepare the input specially (in N 
independent file regions - so that N threads can be decompressing their streams 
independently). `pixz` can create such independent region `.xz` format files 
decompressible with parallel speed-up by itself or without parallel speed-up by 
regular `xz -d`. There is a similar `gzip` tool called `pigz` that can do 
similar for `.gz` files.

I don't know if `pzstd` can do like `pigz` for `gzip` with independent regions. 
`pigz` itself always used to focus only on compression (the slow but do it only 
once part) instead of decompression (the do it many times wouldn't it be great 
to be fast in parallel?).

Even if `pzstd` can unlike `pigz` get parallel un-gzip going, you won't get the 
giant 163:1 compression ratios, though - in fact they will be somewhat worse 
ratios than regular `gzip -9` ratios. Moving to Zstd native compression algos & 
formats is probably the best advice, but also the most work in terms of 
education, persuasion, etc. as I've mentioned before.

Of course, even my `memchr` based `mSlices` only runs about 2x faster than 
`gzip` decompresses. So max(4 seconds, 45 seconds) will be ~45s which isn't 
that much better than ~67s. So, "costs in context" and all that may apply. (I 
think 67s is better than my earlier 78s due to cache effects - decompress to 
the pipe & throw away all in L2 vs decompress to a RAM filesystem if anyone is 
keeping tabs on my numbers that closely which admittedly seems unlikely.)


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-21 Thread cblake
For what it's worth, and for completeness if Windows portability even matters 
in this case (as @markebbert mentioned, these science things are often one time 
deals), this works but is 6x slower (405 sec aka 6min 45sec) than the 
`popen`/`mSlices` variant: 


import strutils, osproc, streams, cligen/mslice
proc main() =
  let p = startProcess("gzip -dc < big.vcf.gz", options={poEvalCommand})
  let outp = p.outputStream
  var line = newStringOfCap(4096).TaintedString
  while outp.readLine(line):
if line.startsWith('#'):
  continue
var i = 0
let msLine = toMSlice(line)
for col in msLine.mSlices('\t'):
  if i >= 9:
for fmt in col.mSlices(':'):
# do something with $fmt
  break
  i.inc
main()


Run

That `streams` code needs some better line-buffering love, though { Or `osproc` 
could use `File` instead of `Stream`}. `system/io.nim:readLine(File,..)` used 
to be a similarly slow almost identical implementation.

But, the clear speed winners so far are either the `mopen` variant decompressed 
if you have the space/RAM or, if you run on Unix, the 
`lines(popen())`-`mSlices` variant (re-encoded with `pzstd` if you need to 
process the same file many times). { If `nimble install` doesn't work for you, 
in a pinch, you could always `git clone https://github.com/c-blake/cligen`, 
copy `cligen/mslice.nim` into the same dir as your program and adjust the 
`import` to its unqualified name. I get that Araq doesn't want to rely upon 
libc `memchr` being fast or support different compile-time/run-time versions, 
but 4X slower is a pretty big hit. That's why I tossed `mslice` into `cligen` 
so others might benefit. I'm not even sure `mSlices` is as fast as possible and 
as I mentioned various overheads clearly depend on string/substring lengths. }


Re: Nim vs. Python & Groovy (string splitting): Why is string splitting so slow in Nim?

2019-08-27 Thread cblake
It looks like `nim-faststreams` is just to bridge the API gap between mmap IO 
and streams interfaces which is nice and all, but won't help for `startProcess` 
or other `popen`-like contexts where in this discussion streams slowness was a 
problem. { @mratsim didn't say it would, exactly, but I thought I should 
clarify (unless I'm wrong). }

More importantly, perhaps to @markebbert's surprise, it turns out that the 
bioinformatics guys are being smarter than `pigz` with this `bgzip` tool which 
supports parallel decompression. That very same example file that's been my 
running example since @jyapayne mentioned it is in fact `bgzip` d. It can 
decompress in only 11.1 seconds on the same machine with 4 threads. (I got it 
down to 3.5 seconds with over 25 threads but that same Zstd example gets down 
to 2.2s using similarly many threads). So, an `mSlices` \+ `popen("bgzip -d 
--threads 4 < 
ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz")` 
version should be able to go in roughly 1 minute for that file.

As @brentp surely knows, his `hts-nim` library actually supports a 
`set_threads` and `read_line` that can deliver that, too. It's entirely 
possible @markebbert could have been getting that much faster decompress time 
all along (with python and groovy as well via a popen analogue). Or maybe his 
`.gz` is not from `bgzip` and he couldn't. The way for him to tell is to run 
`file` on the `.gz` and see if it has "Blocked GNU Zip Format (BGZF; gzip 
compatible), block length" instead of simply "gzip compressed data".

Of course, the compression ratio will still be 12x worse than Zstd. Besides the 
obvious space waste, rather than a tiny 45-75 MB/s you would need backing IO of 
172..542 MB/s (depending on 4 or 25 threads). On the high end many thread case, 
you're looking at either a fast SSD or a RAID array of spinning disks to feed 
that analysis pipeline (assuming zero cost parsing, anyway, which I know his 
example calculation did not have, but that assumption simplifies since he 
didn't really show a full calculation).


Re: Getting Unix-Style file permissions

2019-10-03 Thread cblake
It may also depend on how "standard" the desired string is whether @jasper's 
solution is enough for @cnuzzi. Usually, the setuid, setgid, sticky bits have 
some `'s'` or `'t'` substitutions for the `'x'`, and capitialized `[ST]` if one 
of those bits is set but the corresponding execute permission is missing. I 
think Unix standardization efforts were disintegrating by the time they were 
getting around to this level of the userspace utilities.

So, @jasper may need to use something like this (from 
[https://github.com/c-blake/lc/blob/master/lc.nim](https://github.com/c-blake/lc/blob/master/lc.nim)
 search for `proc fmtPerm`): 


import posix
proc fmtPerm*(m: Mode, s=""): string =
  ## Call with ``.st_mode`` of some ``Stat`` to get rwxr-x..; ``s`` is 
optional
  ## separator string such as a space or a comma to enhance readabilty.
  let m = m.uint and 4095
  const rwx = ["---", "--x", "-w-", "-wx", "r--", "r-x", "rw-", "rwx" ]
  result = rwx[(m shr 6) and 7] & s & rwx[(m shr 3) and 7] & s & rwx[m and 
7]
  let o = s.len
  if (m and 0o4000) != 0 and (m and 0o100) != 0: result[2]   = 's' 
#setuid,+x
  if (m and 0o4000) != 0 and (m and 0o100) == 0: result[2]   = 'S' 
#setuid,noX
  if (m and 0o2000) != 0 and (m and 0o010) != 0: result[5+o] = 's' 
#setgid,+x
  if (m and 0o2000) != 0 and (m and 0o010) == 0: result[5+o] = 'S' 
#setgid,noX
  if (m and 0o1000) != 0 and (m and 0o001) != 0: result[8+o] = 't' 
#sticky,+x
  if (m and 0o1000) != 0 and (m and 0o001) == 0: result[8+o] = 'T' 
#sticky,noX


Run

I didn't see any direct code to do this formatting in `posix`, but you surely 
would `import posix` to get `Mode`. That code just uses the raw low 12 bits, 
but those are actually pretty standard.


Re: Getter and Setter methods in Nim

2019-10-16 Thread cblake
Should probably also be mentioned/of interest to @torarinvik that Nim has 
fancier support for setters as well. See 
[https://nim-lang.org/docs/tut2.html#object-oriented-programming-properties](https://nim-lang.org/docs/tut2.html#object-oriented-programming-properties)


Re: Winning the Base64 benchmarks.

2019-10-17 Thread cblake
Back when I used Cython more, I did always use Cython and just gradually type 
in things as required (or for some things do the equivalent of #include some C 
for some SSE intrinsics type codes). I agree it is not generally "popular", but 
it is a very legitimate mode of usage. Adding more and more cdef's etc. brings 
your code closer and closer to C semantically while staying in Cython syntax. 
Partly, I had better profiling tools for such code than "pure python" style.

They even have a nice \--annotate mode supporting this usage style that spits 
out an HTML page with clickable generated code colorized by how C-like it is. 
That kind of source to "assembly" visualization tool would be nice for Nim as 
well, or even just gcc/clang going all the way to "real" assembly. It's got a 
bit more "oomph" for things spanning orders of magnitude of efficiency like 
pure Py and C, though, and the CPython API calls Cython generates may be (a 
little easier to read than real assembly, much like the C code Nim generates.

Now, why good modes of usage (of almost anything) are not more popular..well, 
that's some question for the ages. People are imitative. "It's not popular 
because it's not popular." How to bootstrap something catching on is 
just..tricky.


Re: Difference between two dates

2019-10-27 Thread cblake
If you need a larger range and only care about "dates" only, there is a pretty 
simple formula valid for all dates after 4800 BC in 
[http://www.faqs.org/faqs/calendars/faq/part2](http://www.faqs.org/faqs/calendars/faq/part2)/
 question 2.16.1


Re: Some questions about cligen

2019-10-27 Thread cblake
At present your best bet for your first style is to just have the wrapped 
`proc` receive a `seq[string]` and just test it yourself. Note though, that you 
aren't really getting much value out of any option parsing framework vs. just 
using `os.cmdLineParams()`.

While it could be more elegant/automatic, the second thing you ask about works 
fine with the current `cligen` as shown by `test/MultMultMult.nim` in the 
`cligen` repo.


Re: Get first element in Table

2019-11-07 Thread cblake
Minor correction to @mratsim \- if you want to iterate and break after the 
first then you can use _any_ of the table types, not only `OrderedTable`. E.g. 


import tables
var a = {1: "one", 2: "two"}.toTable
for key, val in a:
  echo key
  break


Run

Of course, Nim `Table` somewhat unusually allows keys to be duplicated (as in 
an C++ STL `multimap` but with no ordering guarantees). So, you cannot be sure 
(without more work) which of the possibly several entries that first slot 
corresponds to.

You may know that the keys are unique by construction/etc., though, and want to 
pop them off in whatever "fast order" is convenient. Indeed, the `HashSet` in 
`sets` even provides a `pop` proc. E.g., 


import sets
var a = ["one", "two"].toHashSet
let x = a.pop
echo x
echo a


Run


Re: Get first element in Table

2019-11-07 Thread cblake
Of course. I suppose I should have been explicit to mention things are 
"system/data-structure ordered" not "key ordered". Any iteration has "an" 
order, though. That weird order just may or may not be what is desired or it 
might be fine.

In terms of your `sorta`/B-Tree I would say that (like most fledgling ordered 
tree implementations) it does not separate "seeking" from "updating" 
(inserting/deleting). Search trees are often even more useful when augmented. 
Augmented search trees allow multiple kinds of (useful) order. E.g., B-trees 
need merely one more number per already large node and a little extra 
maintenance of that number to support seek-by-rank as well as seek-by-key (thus 
morphing into an order statistics tree). So, a better design is one organized 
around a "path handle", e.g., `path: seq[tuple[ptr,index]]`. Seek-by-X procs 
can populate said `path`, updating procs can act upon it, and iterators can 
start from whatever kind of seek with next/prev updating some path handle. Done 
right the augmentation can be totally optional, decided at instantiation-time. 
Layering a more traditional `Table`-like interfaces on top of that 
path-oriented internal design is then much easier than trying to implement 
everything directly.


Re: Get first element in Table

2019-11-07 Thread cblake
(Sometimes this path handle I am advocating is referred to as a "cursor" in 
analogy with text-editing. I don't usually see it motivated by coding 
economy/simplicity for optionally augmented trees, as I am motivating it here. 
Anyway, these comments/analyses are probably better made as issues on your git 
repo.)


Re: Nim for Statistics

2019-11-09 Thread cblake
> we just Nim someone to write the package

Assuming you meant "need" this is the best ever either slip of the 
tongue/keyboard and/or entry auto-correct. "Nim".."need"..all the same. ;-)


Re: Nim for Statistics

2019-11-09 Thread cblake
Mostly academic points to correct the record, but I feel like mratsim's linked 
"Fenwick Tree" uses up to 2x the cache memory of the trees of Fenwick 
1993,1994,1995. So, maybe not quite even state of the art for the mid-1990s 
(depending on metrics/data sizes)?

Also, maybe not really even Fenwick Trees? The Yu2015 paper referenced in the 
code _very_ questionably re-brands what Wong 1980 had called simply "binary 
trees" as seemingly new but not "F+Trees" while removing the space optimization 
that was Fenwick94's contribution over Wong80. All quite weird.

Don't get me wrong - that sampler is good code. Maybe not optimal, though.


Re: Nim for Statistics

2019-11-10 Thread cblake
For an alphabet size `A` (any int), an actual Fenwick Tree needs only `n = 
ceil(log_2(A))` array slots. If `A` is a power of two they need no extra memory 
over a simple histogram. Your `fenwicktree.newSampler` uses `2*n-1` (for that 
same `n`). And it's not some +-0/1-origin indexing threshold thing, but a legit 
2x through the whole last power of 2. So, we don't disagree on what you use, 
but on what you _could_ use.

That (admittedly confusing) paper even says "In fact, Fenwick Tree can be 
regarded as a compressed version of the F+ tree" (which itself kind of exhibits 
their terminological weirdness). It is to this "compressed" that I refer. If 
you don't want to think about it for a year that's ok. It is pretty far off 
topic. There are even further off topic things like 
[https://arxiv.org/pdf/1612.09083.pdf](https://arxiv.org/pdf/1612.09083.pdf) 
that warrants a mention for the curious.


Re: Nim for Statistics

2019-11-10 Thread cblake
Oh, and while I'm sure mratsim already knows all this, because this is a thread 
about Nim and statistics and R and other things and most passers by/search 
results finers probably do not know this, it bears noting that these binary 
indexed sum trees are often the fastest way to compute things like medians or 
other quantiles or other order statistics over moving data windows, at least up 
to a certain precision aka number of bins.

Such CDF operations are almost the same as sampling without replacement. In 
both cases one is maintaining a CDF and/or its inverse as data comes & goes 
(with moving data windows the data both comes and goes while with sampling it 
only goes). With Wong/Fenwick/binary/Fibonacci/whatever decomposition indexed 
sum trees, the per-datum scaling becomes about the _alphabet size_ or precision 
**not** the window size. (There is always an overall O(N data points) factor 
for the whole data set not just each window.)

By comparison, the R `runmed` implemented in `Srunmed.c`, for example, has 
atrocious `O(w)` scaling in window size `w`. `Srunmed.c`, while better than the 
_most_ naive sort-every-data-window-not-even-with-a-radix-sort `O(w*log(w))` is 
very poor work. It does not even do the 
binary-search-of-an-ordered-array-with-AVX-optimized-memmove adjustments which 
would be both simpler than what `Srunmed.c` already does and a significant 
constant factor speed up as well as generalizing to arbitrary quantiles.

These index decomposition sum trees' (probably the most suggestive name) 
`w`-independence is even better unless you need too many bins. 64..256 KiBins 
is a lot of resolution and fits in L3 these days for even 8 byte counter sizes, 
though 4B or even 2B are often enough. This is obviously easier with the 2x 
space optimization I mentioned above. True full precision for index 
decomposition sum trees needs a bin for every possible number or say 2^32 bins 
for a `float32` which can get very memory intensive or require 
expensive-in-the-small sparse arrays/hashing.

At loss of precision with numbers (as opposed to mratsim's words case), you can 
have some, e.g., exponentially spaced bin mapping from the full range to a more 
manageable amount. If you need full precision/exact answers then usually the 
best performer is a B-tree augmented to be an order statistics tree can get you 
"full precision" with `O(lg(w))` per datum scaling. Even a ranked binary search 
tree as per KnuthV3 in the late 1960s is better than R's `Srunmed.c`, though.

`w` can easily be in the 1000s and N millions or more. So, R's current core 
impl is potentially leaving many orders of magnitude of performance on the 
table. Of course, people using R are rarely very sensitive to performance.

/end-R-impls-a-lot-but-has-longstanding-poor-impls-of-even-basic-stuff-rant|   
---|---


Re: Marshal and friends

2019-11-13 Thread cblake
Totally agreed, but it bears noting that Python also has a module named 
`marshal` with a `dump` and `load` that has generally been much faster/more 
space efficient than `pickle`/ `cPickle`. Basically "pickle but even less 
readable". So, one could also think of Nim `marshal` as like Python `marshal`. 
:-)


Re: Marshal and friends

2019-11-13 Thread cblake
Well, fair point and a nice link. I mostly thought identical names for roughly 
identical things and worth mention.

Since this thread is about storing to disk, I should mention the too often 
neglected option of near-zero overhead native-binary-layout file formats that 
are "`mmap` & go" with no repeated parsing or even unnecessary paging. The file 
system can basically become your memory allocator. Ascii->binary and especially 
binary->Ascii is an expensive conversion for numbers (though any cost is always 
"compared to what").

Native binary does break "portability" to e.g. machines with different 
endianness/byte-order or non-IEEE floating point formats and whatnot. In the 
modern era, little endian and IEEE FP has won to such a degree that in practice 
this may matter little, if at all. In the old days, sharing a network volume 
between e.g., a big endian Sparc and a little endian Intel box was more common.

Anyway, Nim supports this all fine (as in, e.g., my 
[https://github.com/c-blake/suggest)](https://github.com/c-blake/suggest\)), 
but A) that is _far_ from pedagogical code, { but hey, at least it's a fully 
worked example }, and B) it can often be error prone work in general. Someone 
out in Nim land should maybe do a macro package someday to make it easier 
(unless they already have?).


``Table.take`` should be ``Table.pop`` -- discuss

2019-11-13 Thread cblake
Before we leave things as just `take` in `Table`-land for very long/a full 
release cycle and having to worry about backward compatibility, I thought this 
warranted a much wider discussion than a comment thread in a closed pull 
request:

> [https://github.com/nim-lang/Nim/pull/12600](https://github.com/nim-lang/Nim/pull/12600)

Also relevant side question should `pop` be added to `apis.rst`? My opinions 
are in the link, but others should chime in.

This might also be of broader interest to relate to "rapid stdlib retractions 
in the Nim 1.0 era" if we did want to just rename rather than deprecate the 
brand new `take` overloads.


Re: ``Table.take`` should be ``Table.pop`` -- discuss

2019-11-14 Thread cblake
@mratsim \- it's not just you. Formatting looks pretty garbled.

@araq \- sensible to me to duplicate/alias rather than deprecate. Pretty 
low-cost.


Re: the "type" of the curly-bracket structure

2019-11-17 Thread cblake
What Nim calls the Table Constructor as dom96 has correctly pointed out, the 
lisp world has long called "association lists", often abbreviated "alist".

Since it is a macro you are writing, you could check that the kind of the 
NimNode is `nnkTableConstr` and if not emit a useful compile-time message for 
client code, as in something approximately like: 


import macros
macro `@@@`(x: untyped): untyped =
  if x.kind != nnkTableConstr:
error "triple-@ expects a {(,),(,)..} table constructor"
var a = @@@{"fromStruct": "blah", "sub": {"joe": 10}}
#var b = @@@[1,2,3]  #Errors out with above message.


Run


Re: ``Table.take`` should be ``Table.pop`` -- discuss

2019-11-17 Thread cblake
Ok. Just to close the loop on this thread I believe this accommodates all 
expressed views: 
[https://github.com/nim-lang/Nim/pull/12678](https://github.com/nim-lang/Nim/pull/12678)
 (except maybe one week pre-release implies backward compatibility forever view 
or the if one-table has a proc, all tables should have the proc view, but 
people didn't seem to comment on/care much about that aspect of this question).


Re: parseopt with negative numbers as positional arguments

2019-11-26 Thread cblake
It might not be quite as CL-user transparent as what you're looking for, but 
having the CL user quote a space in front of the number usually works as in: 


tempconvert \ -10 F --to=C


Run

or 


tempconvert " -10" F --to=C


Run

You might need to strip the whitespace off pre-parsing depending on what 
integer parser you are using.


Re: ``Table.take`` should be ``Table.pop`` -- discuss

2019-12-01 Thread cblake
An un-keyed `pop(Table): tuple[K,V]` more similar to `HashSet.pop` is mentioned 
in the PR commentary linked above. (I would personally just use the 
already-defined iterator than b3liever's inline expansion, but eh. Either way.) 
Totally suitable for another PR. I just wanted to keep the above one more 
focused.


Re: Best way to store/search/etc and an int-int data structure

2019-12-02 Thread cblake
For the curious, that sparse-dense array construction of mratsim's impl is 
useful when keys are sparse-subsets of a small universe and hails from Preston 
Briggs' Rice PhD thesis in 1992, but has this more self-contained follow-up 
paper:

[http://dcs.gla.ac.uk/~pat/ads2/papers/p59-briggs%5b1%5d.pdf](http://dcs.gla.ac.uk/~pat/ads2/papers/p59-briggs%5b1%5d.pdf)

If you need this to be an "associative array" with "satellite data" as values 
for given keys, that is a straightforward extension where the dense array can 
be an array of pairs (or you could have a parallel dense value array).

At some slightly different point in the design space is Varghese's Aggregate 
Bit Vectors which do not allow associative extension, 
[https://cseweb.ucsd.edu/~varghese/PAPERS/icnp2002.pdf](https://cseweb.ucsd.edu/~varghese/PAPERS/icnp2002.pdf)
 , which is also pretty easy to implement.

And, of course, `Table[int,int]` is not actually that slow either.


Re: Practical examples showing how macros lead to better code?

2019-12-12 Thread cblake
In addition to Andrea's very good examples, there is also 
[https://github.com/c-blake/cligen](https://github.com/c-blake/cligen) which 
can generate a command parser-dispatcher from `proc` signatures or 
from-command-line initters from types. Everyone knows about CLI parsers. I 
think it's a real time saver. The macro code itself is pretty hairy, though.

Much macro code in Nim is pretty involved to handle many cases. If you are 
looking to just get a fully worked out introduction, I think that `memo` code 
is the best place to start.


Re: Proposal to start a Nim-Scientific Community

2019-12-13 Thread cblake
In Nim `float` should be the same as `float64`. `float32` is smaller which 
means less memory bandwidth needed as well as wider vector instructions in 
modern CPUs (e.g., on AVX you can fit 8 float32 in one register but only 4 
float64). Some neural net people want to use a couple variants of 16-bit (or 
even 8-bit) floating point formats, while for certain very high precision cases 
128-bit is becoming less rare.

So, I don't think there is some perfect choice of size/accuracy and so your 
`SomeFloat` idea sounds smarter to me. You might also benefit from having 
"iterative" numerical routines accept some kind of "precision" parameter that 
defaults to something (probably dependent upon the concrete floating point type 
in play), but allows users to maybe trade off speed & accuracy themselves. 
I.e., if you have some series approximation with an error bound then allow 
users to pass the max tolerable error (either in absolute terms or relative 
terms). Nim's named parameter passing with defaults should make it easy to have 
a common convention in your library.


Re: Walking trees without recursive iterators

2019-12-18 Thread cblake
In the interests of a more self-contained thread and to perhaps more easily see 
what it's doing, here is a more brief version of @sschwarzer's algorithm: 


import xmltree

type Handle* = tuple[kidIx: int; par, kid: XmlNode]   #Ix of kid in parent

iterator handlesHiToLo*(node: XmlNode): Handle =
  ## Recursively yield ``Handle`` objects for direct&indirect kids of 
``node``
  ## from high to low index order (on same level). { This order simplifies
  ## removing kids in-place without affecting later yielded kid indices. }
  ## The root node is included in yielded ``Handle``s with a ``nil`` parent.
  
  template recordYield(par: XmlNode, kidIx: int, kid: XmlNode=nil) =
handle = (kidIx, par, (if kid.isNil: par[kidIx] else: kid))
yield handle
  
  proc high(node: XmlNode): int = node.len - 1
  
  var done = false#flag to stop outer loop from inner
  var stack = newSeq[Handle]()
  var handle: Handle  #Last node yielded
  
  recordYield nil.XmlNode, 0, node#Always yield the root first
  if node.len == 0: done = true   #Nothing left to do
  else: recordYield node, node.high   #Start from high node of root & 
proceed
  while not done:
let kidHasKids =
  if handle.par.len - 1 < handle.kidIx: false #kid@kidIx was deld post 
yield
  else: handle.kid.len > 0
if kidHasKids:#Descend tree, build stack, yield last 
grandkid
  stack.add handle
  recordYield handle.kid, handle.kid.high
  continue
while true:   #Yield next (lower index) kid | maybe 
pop,find new
  if handle.kidIx >= 1:
recordYield handle.par, handle.kidIx - 1
break
  if stack.len > 0:
handle = stack.pop()
  else:   #No more nodes to try on the stack. Done.
done = true
break


Run

If you want some other visitation order such as low-to-high then you have to 
adjust things a bit. For in-order/work in the middle you might need to adjust 
it a lot.

I actually kind of doubt it can be simplified if you want to preserve the 
deleteability of nodes pointed to by yielded handles.

In these sorts of situations, recursive code is often drastically more clear. A 
good programming language should facilitate clear code. Therefore Nim iterators 
should grow recursion capability someday. But I'm not signing up to implement 
it, and I wouldn't assert it's a high priority.


Re: Arrays and Sequences in nim

2019-12-18 Thread cblake
As a point of clarification, depending a bit on OS and sysadmin settings, one 
can usually increase stack limits with `ulimit -s unlimited` builtin for most 
shells, maybe `/etc/limits` or `/etc/security/limits.conf` on Linux or even 
_adjustable in-program_ by `setrlimit` (if not blocked by admin settings). 
(Nim's stdlib could use an `RLIMIT_STACK` constant to ease that last one.)

The default limits are usually much less than say system memory divided by 
number of expected threads. It would be very rare that all threads used their 
limit anyway. When the limit is exceeded your thread/process just crashes, much 
as if you set the limit to be unlimited and used all system memory.

The existence of low stack limit defaults historically mostly comes from 
systems with "spinning rust" rather than than NVMe backing store for swap. On 
such systems, an unbounded recursion could bring the machine to a crawl for all 
users. So, the limit was a safety feature. These days, no swap at all is a real 
possibility. Even when swapping is enabled, the limit is more about not having 
some unbounded recursion flush filesystem caches forcing re-reads. Also, kernel 
code often has small stack limits for related but distinct reasons 
(non-swappability of kernel stack/page fault handlers/etc.). Kernel developers 
tend to set default limits based on what they think is "good for you" based on 
their own very stack-conservative styles.

Anyway, the short of it is that if stack is how you would prefer to manage your 
memory (for whatever reasons) there isn't anything intrinsically wrong with 
that related to "size". There's just a little deployment complexity, like maybe 
erroring out if admins have blocked stack limit increasing and your `setrlimit` 
fails which is atypical in my experience. (This is all Unix advice. I have 
little to no experience for how this plays out on Windows, but some more 
informed person can perhaps chime in.)


Re: nim android tutorial

2019-12-18 Thread cblake
Also, it's not an "app", but it bears re-iterating that Nim on Termux works 
just like Nim on a worstation/laptop/other posix-y environment. See, e.g, 
[https://forum.nim-lang.org/t/2891](https://forum.nim-lang.org/t/2891). I have 
a couple dozen Unix commands that compile and run just fine in both 
environments.


Re: Walking trees without recursive iterators

2019-12-18 Thread cblake
That sounds like a bug/limitation in @sschwarzer's algorithm to me, but maybe 
he can verify. Good catch, @e.

Another pretty different approach that I use in `cligen/tern.nim` (that does 
not support in-iteration deletion) is something like this: 


const NUL* = '\0'
type
  NodeOb*[T] {.acyclic.} = object
ch*: char
cnt*: int
when T isnot void: val*: T
kid*: array[3, ref NodeOb[T]]   #0,1,2 ~ <,=,>
  Node*[T] = ref NodeOb[T]
  Tern*[T] = object   ## A Tern can be used as either a mapping from strings
root*: Node[T]## to type ``T`` or as a set(strings) if ``T`` is 
void.
count: int## Number of elements
depth: int## Depth of Tree

iterator leaves[T](r: Node[T], depth:int, pfx=""): tuple[k:string, 
n:Node[T]] =
  type   #Nim iterators should 
grow
Which = enum st0, st1, st2, stR  #..recursion 
capability so
State = tuple[st: Which, k: string, n: Node[T]]  #..this headache can 
be as
  if r != nil:   #..easy as `print`.
var stack = newSeqOfCap[State](depth)
stack.add (st: st0, k: pfx, n: r)
while stack.len > 0:
  let state = stack[^1].addr
  if state.n == nil: break
  case state.st
  of st0:
state.st = st1
if state.n.kid[0] != nil:
  stack.add (st: st0, k: state.k, n: state.n.kid[0])
  of st1:
state.st = st2
if state.n.ch == NUL:
  yield (k: state.k, n: state.n)
elif state.n.kid[1] != nil:
  stack.add (st: st0, k: state.k & $state.n.ch, n: state.n.kid[1])
  of st2:
state.st = stR
if state.n.kid[2] != nil:
  stack.add (st: st0, k: state.k, n: state.n.kid[2])
  of stR: discard stack.pop


Run

The above is probably pretty close to what compilers generate for recursive 
function calls. Of course, as mentioned I think things like this are more 
elegant/easy to understand: 


proc print*[T](p: Node[T], depth=0) = #2,1,0 gives std emoji-orientation
  if p == nil: return #..i.e. ":-)" head-tilt not "(-:".
  print(p.kid[2], depth + 1)
  echo strutils.repeat("  ", depth), cast[int](p), " ch: ", p.ch.repr, " 
cnt: ", p.cnt
  print(p.kid[1], depth + 1)
  print(p.kid[0], depth + 1)


Run

For `cligen/trie.nim` since I was already using a super wasteful uncompressed 
trie, I just punted on all that and spent memory proportional to the answer, 
i.e.: 


proc collect*[T](n: Node[T], key: var string, d=0, pfx="", i=0): 
seq[tuple[k: string, n: Node[T]]] =
  if n == nil:
return
  if n.term:
result.add (k: (if i > 0: pfx[0..

Re: Walking trees without recursive iterators

2019-12-23 Thread cblake
> What do you think?

I agree that at least a few others would appreciate some fully worked out 
general tree iterator example and I encourage you to undertake the challenge.

A possibly helpful side-comment is that Python didn't really have recursive 
generators for the whole 1990s, but had a similarly nicely terse `for`-loop 
syntax for iteration. Then it grew `for x in sameItr: yield x` capability and 
finally `yield from`. So, there may be example code from back then to work off 
of.

Yet another possibility besides the several I mentioned and Araq's already 
mentioned visitor with a callback idea, would be to define an object holding 
all the state you need with a `next` method. Then the iterator could just 
instantiate one of those and call the next method until done. It would also be 
nice to be able to iterator forwards as well as backwards.

Another side-comment pertaining to @rayman22201's "parent node" idea is that 
while it can make tree navigation easier, for multi-way trees similar to your 
XML case, this can be undesirable. The reason is that when the parent changes 
many nodes need to have their parent field updated. While one frequently sees 
red-black or AVL trees with a parent node, one almost never sees B-trees with a 
parent node due to that update cost. Of course, there's also the storage cost 
(N-nodes with parents instead of a depth-proportional stack).


Re: A path commonPrefix function

2020-01-04 Thread cblake
> Is there a nim equivalent to Python's timeit module for timing snippets?

There are probably about 100 variants in people's local code, but I happened to 
just see your message and also to just add this to `cligen/osUt` the other day 
(needs `requires "cligen#head"` in your `.nimble` file or you could just copy 
this tiny impl and adapt as you like): 


import strutils, times

template timeIt*(label:string, unit:float, places=3, sep="\n", body: 
untyped) =
  let t0 = epochTime()
  body
  let dt = epochTime() - t0
  stdout.write label, " ", formatFloat(dt * unit, ffDecimal, places), sep

timeIt("write1", 1e6, 3, " ") : stderr.write "hi"
timeIt("write2", 1e6, 3, "\n"): stderr.write "there"


Run

You may want to add loops to repeat and use the `stats.RunningStat` stdlib API 
to do something nice more like Python's version. There is also this package 
called golden 
([https://github.com/disruptek/golden](https://github.com/disruptek/golden)) to 
try to iterate until timing noise has been suppressed. And probably other 
various solutions, too. (And sometimes noise **is** the data and you want 
cold-cache/system competition effects included...benchmarking is obviously a 
larger topic than a snippet thing.)


Re: A path commonPrefix function

2020-01-04 Thread cblake
Are you aware of this binary search approach? 
[https://www.tutorialcup.com/string/longest-common-prefix-using-biary-search.htm](https://www.tutorialcup.com/string/longest-common-prefix-using-biary-search.htm)


Re: A path commonPrefix function

2020-01-04 Thread cblake
Eg., in Nim it might look like this: 


proc minLen(strs: openArray[string]): int =
  result = int.high
  for s in strs: result = min(result, s.len)

proc check(strs: openArray[string]; j0, j1: int): bool =
  let srcSlc = strs[0][j0..j1]
  for s in strs:
if s[j0..j1] != srcSlc: return false
  return true

proc longestCommonPfx*(strs: openArray[string]): string=
  if strs.len < 1: return ""
  let src = strs[0]
  if strs.len < 2: return src
  var lo = 0  # Binary search
  var hi = strs.minLen - 1
  while lo <= hi:
let mid = lo + (hi - lo) div 2
if check(strs, lo, mid):  # All strs match this
  result.add src[lo .. mid]   # Append to answer
  lo = mid + 1# Go higher
else: # Go lower
  hi = mid - 1

when isMainModule:
  import os
  echo longestCommonPfx(commandLineParams())


Run

There are also a lot of variants at 
[http://rosettacode.org/wiki/Longest_common_prefix](http://rosettacode.org/wiki/Longest_common_prefix)

For paths you would just need to add something to backup to the most recent 
directory separator, if any.


Re: A path commonPrefix function

2020-01-04 Thread cblake
`result += mid + 1 - lo` is, indeed, way cheaper than `result.add`, but that 
binary search approach only does the `result.add` about `ceil(log_2(minLen))` 
times (i.e., like 3-10 times probably). I couldn't even measure the difference 
on my 3078 entry average total length 17 `/usr/bin/\*` example (i.e. noise was 
>> difference).

Regardless, I **100% agree** that separating into a prefixLen and then getting 
that slice would be a nicer factoring anyway. The caller may only need to test 
the length against `0` or something.

Another algorithm of interest in terms of this whole thread is the "just 
compare first & last in sorted order" approach. This very "tidy" from both a 
conceptual and coding point of view: 


proc range*[T](xs: openArray[T]): tuple[min, max: T] =
  if xs.len < 1: return   # default vals for T
  result.min = xs[0]
  result.max = xs[0]
  for x in xs:
result.min = min(result.min, x)
result.max = max(result.max, x)

proc longestCommonPrefixLen*(strs: openArray[string]): int =
  let (min, max) = strs.range
  for i, c in min:
if c != max[i]: return i
  return min.len


Run

As written, that's about 6x slower than the binary search approach for L2 
resident data. It's probably possible to really speed up `range` for 
`openArray[string]` with some shallow copies in the loop and a real copy at the 
end. That might be a fun exercise for someone.

Indeed, for very large inputs, bandwidth starvation of modern CPUs will mean 
that doing only one pass over said inputs is going to be best, even if it does 
quite a bit more CPU ALU-type work. So, even if that `range` cannot be sped up 
to match/beat the binary search in this context, one would still want to switch 
to this (or some other) one pass algo to accommodate giant inputs in a more 
general setting..probably past the L2 or L3 boundary (assuming it can run with 
minimal cache competition).

@marks did not really discuss the size of his inputs or his use case, though 
his second "faster on his test set" is notably a multi-pass algorithm.


Re: A path commonPrefix function

2020-01-05 Thread cblake
> for s in paths[1 .. ^1]:

The above line really slows things down. Pretty sure it's creating a copy of 
that slice every outer loop. Change it to `for j in 1 ..< paths.len:` and use 
`s[j][i]` below.

I still timed your version as about 1.5x the run time of my version. That 
residual was attributable to your `foldl` vs my `minLen`. It might be simpler 
to just use my version completely when adjudicating it. :-)

Also, if you can, you should measure/say something about your expected total 
number of strings, average length (total data), and typical answer length.

A less embarasingly slow version of `range` for "expensive `T`" than my prior 
post is: 


proc range*[T](xs: openArray[T]): tuple[min, max: T] =
  if xs.len < 1: return   # default vals for T
  var iMin, iMax: int # zero init
  for i in 1 ..< xs.len:
if xs[iMin] < xs[i]: iMin = i
if xs[iMax] > xs[i]: iMax = i
  result.min = xs[iMin]
  result.max = xs[iMax]


Run

With this new version, the one-pass version is only about 1.5x the run-time of 
the binary search version on that easily L2 cachable 3078 string avg len 17 
bytes with prefix 9 running example of mine.

Of course, as alluded to, the ratio of main DIMM bandwidth to L[12] cache 
bandwidth is almost always much greater than a small factor like 1.5..More like 
4-5x. So, at some point the one pass range/string compare method will be 
fastest, but that point may not matter for your use case.


Re: A path commonPrefix function

2020-01-05 Thread cblake
And before someone corrects me, an even less embarrassing variant might be: 


proc rangeAt*[T](xs: openArray[T]): tuple[minAt, maxAt: int] =
  if xs.len < 1: return (-1, -1)  # raise?
  for i in 1 ..< xs.len:  # note tuple ints init to zero
if xs[result.minAt] < xs[i]: result.minAt = i
if xs[result.minAt] > xs[i]: result.minAt = i


Run

with then `let (minAt, maxAt) = strs.rangeAt` and tracking `(min|max)` -> 
`strs[(min|max)At]` changes in the string comparison for the prefix length 
algo. Some paired-up value oriented interface for those who want it would also 
make sense: 


proc range*[T](xs: openArray[T]): tuple[min, max: T] {.inline.} =
  if xs.len < 1: return   # raise?
  let (minAt, maxAt) = xs.rangeAt
  result.min = xs[minAt]
  result.max = xs[maxAt]


Run

in which case maybe no change to the prefix len algo is needed.

While off the topic of the prefix calculation, it perhaps bears noting that 
SSE/AVX calculations can sometimes speed up the value-oriented versions of 
these range computations over arrays of CPU supported types like 
`openArray[float32]` by large factors - 24x even. Basically branchless and 
batching vector min/max instructions can be leveraged, but I am unaware of CPUs 
supporting "minAt" type vector instructions. Recent versions of gcc can 
recognize these min/max type value sweeps and correctly generate optimized 
code, but such pattern recognition can often fail as the code grows just a 
little more complex.


Re: A path commonPrefix function

2020-01-06 Thread cblake
It's a bit tricky to explain why your `foldl` would be much faster than my 
`minLen` (note "min" not "max"). Perhaps you aren't compiling with max 
optimization/`-d:danger` or something? Or perhaps some peculiarity related to 
your test data. Total data size, number of strings, and size of the found 
prefices are more salient statistics than how many times you repeat, though 
another possible explanation is that 100k repeats in a tight loop does 
something weird with a warmed up CPU branch predictor. My timings come from a 
Unix shell loop doing 10..100 repeated runs of the whole program. As with much 
benchmarking, it's debatable what is most representative.

I'll include the whole program below for reference. I called @marks' algorithm 
`lcpVertical` for vertically-oriented scanning and swapped `i` & `j` to have 
more traditional `j=column index` notation and slightly earlier exit. Because 
cligen allows unique prefixes `./lcp -alcpb /usr/bin/\*` is a valid way to run 
things in binary search mode. I don't do the final `rfind(sep)` work in any of 
those, but it's the same work for all the algorithms anyway and just a couple 
line wrapper `proc`.

Anyway, I've run with a variety of inputs and things track my stated 
expectations perfectly (compiled with devel version of nim, d:danger, gcc-9.2, 
on Linux, i6700k). For really large inputs the range algo works best..for 
really small the binary search. So, an optimal at all scales would probably 
switch from binary search to range at some machine/system-load dependent 
threshold.

stdlib-wise, the `range` LCP algo is never worse than about 2X the run time on 
my various /usr/xyz/* type tests, though. So, from a non-tuning/performance 
robustness point of view, it would be a better candidate for stdlib inclusion. 
Beyond the "one stop shopping" algo-wise and almost trivial implementation, its 
`rangeAt` & `range` helpers are much more basic/common operations than longest 
common prefix. Beyond that basicness, as mentioned the value-based `range` may 
afford strong-speed-ups for specialized types. So, it might be helpful to have 
a single point of reference that could be optimized/assumed optimized the way 
certain standard C library functions are, such as `memchr`.


when defined(foldl):
  import sequtils # For me this is slower
  proc minLen(strs: openArray[string]): int {.inline.} =
foldl(strs.mapIt(len(it)), min(a, b))
else:
  proc minLen(strs: openArray[string]): int {.inline.} =
result = int.high
for s in strs: result = min(result, s.len)

proc check(strs: openArray[string]; j0, j1: int): bool =
  let first = strs[0]
  for s in strs:
for j in j0 .. j1:
  if s[j] != first[j]: return false
  return true

proc lcpLenBinSearch(strs: openArray[string]): int =
  if strs.len == 0: return 0  # maybe -1 instead?
  if strs.len == 1: return strs[0].len
  var lo = 0  # Binary search
  var hi = strs.minLen - 1
  while lo <= hi:
let mid = lo + (hi - lo) div 2
if check(strs, lo, mid):  # All strs match this
  result += mid + 1 - lo  # Append to answer
  lo = mid + 1# Go higher
else: # Go lower
  hi = mid - 1

proc rangeAt*[T](xs: openArray[T]): tuple[minAt, maxAt: int] =
  if xs.len == 0: return (-1, -1) # raise?
  for i in 1 ..< xs.len:
if xs[result.minAt] < xs[i]: result.minAt = i
if xs[result.minAt] > xs[i]: result.minAt = i

proc range*[T](xs: openArray[T]): tuple[min, max: T] =
  if xs.len == 0: return  # raise?
  let (minAt, maxAt) = xs.rangeAt
  result.min = xs[minAt]
  result.max = xs[maxAt]

proc lcpLenRange(strs: openArray[string]): int =
  if strs.len == 0: return -1 # raise?
  let (minAt, maxAt) = strs.rangeAt
  for i, c in strs[minAt]:
if c != strs[maxAt][i]: return i
  return strs[minAt].len

proc lcpLenVertical(strs: openArray[string]): int =
  if strs.len == 0: return 0  # simplified marks algo
  let first = strs[0]
  for j in 0 ..< first.len:
for i in 1 ..< strs.len:
  if j >= strs[i].len or first[j] != strs[i][j]:
return j
  return first.len

type lcpAlgo* = enum lcpBinSearch, lcpRange, lcpVertical

proc lcpLen*(strs: openArray[string], algo=lcpVertical): int =
  case algo
  of lcpBinSearch: return strs.lcpLenBinSearch
  of lcpRange: return strs.lcpLenRange
  of lcpVertical:  return strs.lcpLenVertical

proc lcp*(strs: openArray[string], algo=lcpBinSearch): string =
  if strs.len < 1: return ""
  result = strs[0][0 ..< strs.lcpLen(algo)]

when isMainModule:  # asserts are the Rosetta Code tests
  assert lcp(["interspecies", "interstellar", "interstate"]) == "inters"
  assert lcp(["thr

Re: A path commonPrefix function

2020-01-06 Thread cblake
Oops. Didn't survive some edits. Good catch as always, @e. Will fix in the 
prior post with a note. Does not effect my results which are like this:


$ echo /usr/bin/*|wc
  13077   58306
$ (repeat 50; lcp -alcpV /usr/bin/*)|mnsd
9 in 53.97 +- 0.30 microseconds via lcpVertical
$ (repeat 50; lcp -alcpR /usr/bin/*)|mnsd
9 in 51.33 +- 0.42 microseconds via lcpRange
$ (repeat 50; lcp -alcpB /usr/bin/*)|mnsd
9 in 39.29 +- 0.78 microseconds via lcpBinSearch


Run

(I use Zsh which has `repeat` and a little mean-sdev computer.) Then for a 
larger inputs: 


$ echo /usr/lib/python3.6/**|wc
  1   39036 3108640
$ (repeat 10 lcp -alcpV /usr/lib/python3.6/**)|mnsd
19 in 2388. +- 14. microseconds via lcpVertical
$ (repeat 10 lcp -alcpR /usr/lib/python3.6/**)|mnsd
19 in 1131.1 +- 1.5 microseconds via lcpRange
$ (repeat 10 lcp -alcpB /usr/lib/python3.6/**)|mnsd
19 in 1523. +- 17. microseconds via lcpBinSearch


Run

The L3 on an i7-6700k is 8MB. So, this is showing the range method pulling 
ahead while still in L3 (3.1 MB). At one pass, it really has to be 
approximately the fastest at large scale. Reading from DIMMs the difference 
should be even more pronounced in favor of range.

It's possible that that marks/vertical scan style might be faster if the answer 
is very small. E.g., for a zero common prefix length, the answer will be 
concluded after just one pass down the first column via `lcpVertical`.


Re: Walking trees without recursive iterators

2020-01-06 Thread cblake
Well, this really does not address @spip's question at all, but it might be 
something @sschwarzer could use as an example for a multi-way/k-ary tree 
iterator. The iteration itself (not via `iterator`, but via `Path` also should 
support deletion/insertion/other tree surgery. Maybe this time @e won't find 
any bugs. Famous last words. ;-) 


type  #k-ary search tree (like a B-tree)
  NodeOb {.acyclic.} = object
data*: seq[int] #.data.len > 0
kids*: seq[ref NodeOb]  #.kids.len == .data.len + 1
  Node* = ref NodeOb#.. Also, .kids.len==0=>LEAF
  Path* = seq[tuple[p: Node, j: int]]

proc newNode*(data: openArray[int]): Node =
  result = Node.new #Convenience Node builder
  result.data.setLen(data.len)
  for i, x in data: result.data[i] = x

proc seek_most*(path: var Path, t: Node, lower=true) =
  if t == nil: return#Extend path from t to least|..
  var t = t  #..|greatest-most kid per lower
  while t.kids.len > 0:
path.add (t, if lower: 0 else: t.kids.len - 1)
t = t.kids[path[^1].j]
  path.add (t, if lower: 0 else: t.data.len - 1)

proc seek_most*(t: Node, lower=true): Path =
  result.seek_most(t, lower)  #Conveniene Path builder

proc seek_succ*(path: var Path) = # path -> successor
  if path.len == 0: return#Just for safety
  if path[^1].p.kids.len > 0:
path[^1].j.inc
path.seek_most(path[^1].p.kids[path[^1].j], true)
return
  while path[^1].j == path[^1].p.data.len - 1:
path.setLen(path.len - 1) #discard path.pop
if path.len == 0: #path @beg; No more succs
  return
path[^1].j.dec
  path[^1].j.inc

proc seek_pred*(path: var Path) = # path -> predecessor
  if path.len == 0: return
  if path[^1].p.kids.len > 0:
path.seek_most(path[^1].p.kids[path[^1].j], false)
return
  while path[^1].j == 0:
path.setLen(path.len - 1) #discard path.pop
if path.len == 0: #path @beg; No more preds
  return
  path[^1].j.dec

iterator forward*(root: Node):
tuple[node: Node; j, depth, val: int] =
  var path = seek_most(root, true)
  while path.len > 0:
yield (path[^1].p, path[^1].j, path.len,
   path[^1].p.data[path[^1].j])
path.seek_succ

iterator reverse*(root: Node):
tuple[node: Node; j, depth, val: int] =
  var path = seek_most(root, false)
  while path.len > 0:
yield (path[^1].p, path[^1].j, path.len,
   path[^1].p.data[path[^1].j])
path.seek_pred

import strutils
proc `$`(n: Node): string = cast[uint](n).toHex
let spaces = repeat(' ', 80)

proc print*(n: Node, depth=0) =
  if n == nil: return
  for i in 0 ..< n.data.len:
if i < n.kids.len:
  n.kids[i].print depth + 1
echo spaces[0..<2*depth], $n, "(",i,"): ", n.data[i]
  if n.kids.len > 0:
n.kids[^1].print depth + 1

var root = newNode([100, 200, 300]) # These values
root.kids.add newNode([20, 40, 60]) #..make this a
root.kids.add newNode([120, 140, 160])  #..full B-tree
root.kids.add newNode([220, 240, 260])
root.kids.add newNode([320, 340, 360])
root.print
echo ""
for (p, j, depth, val) in root.reverse:
  echo "R: ",spaces[0..<2*depth], $p, "(",j,"): ", val
echo ""
for (p, j, depth, val) in root.forward:
  echo "F: ",spaces[0..<2*depth], $p, "(",j,"): ", val


Run


Re: A path commonPrefix function

2020-01-06 Thread cblake
(One could probably inline the range work into the lcp and keep track of just 
the first character and exit early if it's ever different to combine the best 
properties...)


Re: A path commonPrefix function

2020-01-06 Thread cblake
For explicitness, 


proc lcpLenVRange(strs: openArray[string]): int =
  if strs.len == 0: return -1 # raise?
  var minAt, maxAt: int
  for i in 1 ..< strs.len:
if strs[i] < strs[minAt]: minAt = i
if strs[i] > strs[maxAt]: maxAt = i
if strs[i].len > 0 and strs[i][0] != strs[0][0]:
  return 0
  for i, c in strs[minAt]:
if c != strs[maxAt][i]: return i
  return strs[minAt].len


Run

Integrating that into the test harness appropriately and running on my two 
running data input examples (extended to a case with zero length common 
prefix), and switching to `mnmx` (which gives the range of times instead of 
mean+-sdev) then gives: 


$ for a in lcpVe lcpR lcpB lcpVR; { (repeat 20; lcp -a$a 
/usr/lib/python3.6/**)|mnmx }
19 in 2300.262..2448.32 microseconds via lcpVertical
19 in 1122.236..1162.767 microseconds via lcpRange
19 in 1523.256..1558.781 microseconds via lcpBinSearch
19 in 1189.47..1224.756 microseconds via lcpVRange

$ for a in lcpVe lcpR lcpB lcpVR; { (repeat 20; lcp -a$a /usr/bin/*)|mnmx }
9 in 51.737..76.532 microseconds via lcpVertical
9 in 48.637..68.903 microseconds via lcpRange
9 in 36.24..40.054 microseconds via lcpBinSearch
9 in 46.253..54.598 microseconds via lcpVRange

$ for a in lcpVe lcpR lcpB lcpVR; { (cd /usr/bin; repeat 20; lcp -a$a 
*)|mnmx }
0 in 3.099..5.722 microseconds via lcpVertical
0 in 43.392..80.109 microseconds via lcpRange
0 in 8.583..13.113 microseconds via lcpBinSearch
0 in 3.099..5.245 microseconds via lcpVRange

$ for a in lcpVe lcpR lcpB lcpVR; { (cd /usr/lib/python3.6; repeat 20; lcp 
-a$a **)|mnmx }
0 in 7.868..11.683 microseconds via lcpVertical
0 in 911.713..959.158 microseconds via lcpRange
0 in 174.522..190.735 microseconds via lcpBinSearch
0 in 9.537..14.782 microseconds via lcpVRange


Run

So, you can get the early exit of the vertical method for zero common prefices 
in the range method without very much extra cost. Eg., slightly 
negative/in-the-noise cost for the L2 prefixLen==9 case and 6% slower for the 
prefixLen=19 L3 case with the early exits basically matching times in the two 
prefixLen==0 cases.

(and, yeah, yeah..the Nim program should `import stats`, do the repeated loops 
internally updating a `RunningStat`, and report. It could even take an `lcpAll` 
enum to report on all algorithms..and -- if one wants to commit to only file 
tree inputs -- maybe even take just a directory as a command parameter, `import 
oswalkdir`, and use `walkDir` and `walkDirRec` instead of `*` and `**` to 
remove dependency on shell features.)


Re: A path commonPrefix function

2020-01-07 Thread cblake
Things in that last version weren't robust/efficient with regards to zero 
length strings being in the mix. This fixes that: 


proc lcpLenVRange(strs: openArray[string]): int =
  if strs.len == 0: return -1 # raise?
  if strs[0].len == 0: return 0   # ANY "" anywhere => 0
  let byte0 = strs[0][0]
  var minAt, maxAt: int
  for i in 1 ..< strs.len:
if strs[i].len == 0: return 0
if strs[i] < strs[minAt]: minAt = i
if strs[i] > strs[maxAt]: maxAt = i
if strs[i][0] != byte0: return 0
  for i, c in strs[minAt]:
if c != strs[maxAt][i]: return i
  return strs[minAt].len


Run

In the interests of being mostly self-contained, I also pushed a better 
`timeIt` to `cligen#head` and updated my driver to use the new `timeIt` and to 
**not** time the `stdout.write` parts: 


proc timeAlgo(algo=lcpVRange, reps=1, strs: seq[string]) =
  var x: int
  timeIt(stdout.write, "", 1e-6, 3, " usec via " & ($algo)[3..^1], reps):
x += strs.lcpLen(algo)
  echo " ", int(x.float / reps.float + 0.5)


Run

Finally, I created a `benchLcp.zsh` script: 


#!/bin/zsh
lcp=`pwd`/lcp   #We need Zsh only for its **
echo "L3 data, 19 answer"
for a in lcpVe lcpR lcpB lcpVR; do $lcp -r10 -a$a /usr/lib/python3.6/**; 
done
echo "L2 data, 9 answer"
for a in lcpVe lcpR lcpB lcpVR; do $lcp -r10 -a$a /usr/bin/*; done
echo "L3 data, 0 answer"
for a in lcpVe lcpR lcpB lcpVR; do (cd /usr/lib/python3.6; $lcp -r10 -a$a 
**); done
echo "L2 data, 0 answer"
for a in lcpVe lcpR lcpB lcpVR; do (cd /usr/bin; $lcp -r10 -a$a *); done


Run

and used a little `nim-pgo` wrapper I have to use gcc's profile guided 
optimization (`-fprofile-generate` and `-fprofile-use`), compiling, running a 
test/bench, then re-compiling. With all that, I get: 


L3 data, 19 answer
 1910.448..2253.771 1959.491 +- 32.812 usec via Vertical 19
 686.884..1150.131 777.125 +- 48.653 usec via Range 19
 922.680..1522.064 996.113 +- 58.569 usec via BinSearch 19
 693.560..1163.244 786.591 +- 49.932 usec via VRange 19
L2 data, 9 answer
 41.246..63.181 45.848 +- 2.480 usec via Vertical 9
 37.193..54.359 40.936 +- 1.778 usec via Range 9
 22.650..25.511 23.413 +- 0.353 usec via BinSearch 9
 36.716..55.075 41.509 +- 2.292 usec via VRange 9
L3 data, 0 answer
 0.000..0.954 0.167 +- 0.094 usec via Vertical 0
 535.488..1017.570 627.780 +- 48.740 usec via Range 0
 58.651..205.994 82.874 +- 15.300 usec via BinSearch 0
 0.715..3.576 1.097 +- 0.278 usec via VRange 0
L2 data, 0 answer
 0.000..0.000 0.000 +- 0.000 usec via Vertical 0
 32.663..38.862 33.927 +- 0.650 usec via Range 0
 3.099..4.768 3.290 +- 0.166 usec via BinSearch 0
 0.000..0.238 0.048 +- 0.032 usec via VRange 0


Run

which makes total sense to me. The @marks performance discrepancies (except his 
mysterious `foldl` bit), probably just come from the "mix"/averaging of how 
many zero length answers are in his inputs.

So, this new variant of that VRange hybrid method would make the most sense to 
recommend for general usage (of the above selection, anyway). It's never that 
slow, fastest at large scale when performance matters most, and can leverage 
early exit in easy cases.

The binary search way for smaller scale data may be best if the caller already 
knows and can pass in a data scale parameter. If we must measure the data size, 
that takes a whole pass through the length fields. That measurement eats into 
the 36/22=1.6x L2-non-zero speed ratio, probably reducing it to like 1.3x which 
isn't that much a boost for all the complexity of measurement+algorithm 
switching. I got about a 1.15x boost from PGO and most probably skip that 
entirely to avoid build complexity.


Re: Introducing --gc:arc

2020-01-07 Thread cblake
On Linux 5.4.2, `gcc 9.2.0 -march=native -O3 -fno-strict-aliasing` , head of 
devel (...ea39524bac14f6c89de5213509f4099) with a similarly tweaked run-5-times 
variant (using `epochTime` instead of `getTicks`), I see no discrepancy in 
elapsed time in `--gc:arc -d:release` and `--gc:arc -d:danger`. It does seem 
maybe more variable with only `-d:release`, but I have not studied it very 
systematically. I get numbers like 0.07 milliseconds per 1 loop of the 5 both 
ways..min == max, even. It does vary a lot from run-to-run of the program 
though, with both release & danger modes. I suspect we need a more reliable way 
to benchmark before being too worried.


Re: Introducing --gc:arc

2020-01-07 Thread cblake
One other thing I can say that may be helpful is that when you do the "max time 
of N samples" you shift the probability distribution (not density) by an 
_exponent_ of N. See 
[https://en.wikipedia.org/wiki/Extreme_value_theory](https://en.wikipedia.org/wiki/Extreme_value_theory)
 for details. Basically, if the latency has distribution F(t), the worst of 5 
has distribution F(t)**5. The median is defined as dist=0.5. So, the median of 
the worst of 5 corresponds to 1-0.5**5 = 96.875th percentile time. Similarly, 
the median of the min time of 5 runs would be the 3.125th percentile of the 
base time distribution. So best/worst of 5 you expect (in a median sense) to 
sample the lower/upper tail of the base distribution. A lot of the time in 
benchmarking one takes the min of N to filter out all the noise from other 
things going on in the system. Unlike many things in statistics these results 
do not rely on anything about the shape of F(t)..It's just basic probability 
theory.

What this means in this benchmark example is that it is going to be very 
sensitive (by construction if not intent) to **a lot** of competing activity on 
the system - network interrupts, competing processes, OS kernel goings on, 
L2/L3 cache evictions, etc. So, it is "noisy by design". The bigger N the 
noisier the max will be. For example, if you have a Winchester hard drive on 
the system and you make N big enough you will sample some suspension of the 
process, CPU migration, total re-warm up of L2 cache, and however long the disk 
service takes. None of that really relates _directly_ to "memory manager 
performance". It's more measuring how loaded your system is.

Anyway, to me this explains the high run to run variability/difficulty of 
interpretation/difficulty to reproduce here. 97th percentile is pretty high and 
that's just the median. The tail of F(t)**5 is even worse than the median. And 
those tail-tail events are ever less relevant to comparing memory managers. To 
the extent they are of interest, because they are so very noisy, it's going to 
require a lot of sampling to assess them.

The truly worst-worst-worst cases where everything is cold cache are actually 
pretty hard to measure for a variety of reasons. A lot of the time "hot loop 
benchmarks" can be **_very_** misleading about worst case latency and even pros 
mess this up _all the time_. I could provide a couple of fun examples sometime.

Memory is mostly about DIMM/L3/L2/L1 latencies and bandwidths, though. You can 
measure those and reason from there more by looking at code and "calculating" 
from those measured constants than by timing it. Often cold cache DIMM hits are 
the most important thing if you aren't moving a lot of memory around. Then 
number of copies if you are moving. { This is why I like linear probing hash 
tables (with|without only local Robin Hood reorg which can help insert-heavy 
workloads). You just know it's basically just one 64-128 byte cache fill for 
really big tables when it's most painful, and you can't do better than 1. }


Re: Introducing --gc:arc

2020-01-07 Thread cblake
TL;DR - min of N is bounded below by the true, necessary time to do some 
calculation and can substitute for a "best case analysis". On the other side, 
max of N times on a multi-tasking system is not bounded by anything and 
polluted by everything. The upper tails mostly measure things that probably "do 
not generalize" to other systems (or even the same system at other times). max 
of N times cannot substitute for worst case analysis and mostly creates 
confusion which I suspect is the case here.


Re: Introducing --gc:arc

2020-01-08 Thread cblake
Also, I just noticed that the `worst` in Araq's benchmark is the "max of 
`msgCount` == 1 million trials", effectively. @miran did not mention if he was 
careful to re-initialize `worst` to zero in his "slightly modified" benchmark 
either. So, it might be more like 5 million. As should be clear from above, the 
max of millions is just begging to capture system noise not the effect in 
question, and I would not expect very consistent results over runs/systems/or 
even over time on the same system.

Here is a perhaps interesting example of latency measurement. The basic task is 
measuring memory latency. One might **_think_** just accessing a random element 
of a large array that fits in available RAM is a way to do this, but one would 
be **_wrong_** at least in a "hot loop" setting. Anyway, not that it's rocket 
science, but the program also exhibits a way to measure some low quantile like 
min of N (itself a statistical estimate and so in this case averaged/sdev'd): 


import random, times, strutils, stats, cligen

proc prepRanElt(x: var seq[int], n: int) =
  for i in 0 ..< n: x[i] = rand(9) #9 keeps sums short

#Optimized for speed not randomness to maximize delta
var r = initRand(123)
proc runRanElt(x: seq[int], nAcc: int): int =
  let mask = x.len - 1 #XXX (1 shl floor(lg(x.len-1
  for i in 1..nAcc: result += x[r.next.int and mask]

proc prepShuffle(x: var seq[int], n: int) =
  for i in 0 ..< n: x[i] = i#Pop seq with identity
  x.shuffle #..perm & then shuffle

proc runShuffle(x: seq[int], nAcc: int): int =
  for i in 1..nAcc: result = x[result]

proc fmt(x=0.0, n=3): auto=formatFloat(x, ffDecimal, n)

proc time(prep, run: auto; n, nAcc, avgN, minN: int) =
  var dtMins: RunningStat
  var s = 0 #Block optimizing away all work by use
  var x = newSeq[int](n)
  x.prep n
  for avgIt in 1..avgN:
var dtMin = float.high
for minIt in 1..minN:
  let t0 = epochTime()
  s += x.run(nAcc)
  let dt = (epochTime() - t0) * 1e9 / nAcc.float
  dtMin = min(dtMin, dt)
dtMins.push dtMin
  echo "KiB: ",n shr 7," ns/Access: ", fmt(dtMins.mean),
   " +- ",fmt(dtMins.standardDeviationS)," s:", s

type Algo = enum ranElt, shuff

proc lat*(kind=shuff, sizeKiB=1048576, nAcc=1_000_000,
  avgN=4, minN=4, seed=0) =
  ## Time latency two ways.  One does NOT measure it!
  if seed > 0: r = initRand(seed)
  else: randomize(); r = initRand(rand(10))
  let n = (sizeKiB shl 10) div int.sizeof
  if kind == shuff:
time(prepShuffle, runShuffle, n, nAcc, avgN, minN)
  else:
time(prepRanElt, runRanElt, n, nAcc, avgN, minN)

dispatch(lat, help={"kind": "shuff: chase ran perm\n" &
"ranElt: access ran elt",
"seed": "0=>random, else set" })


Run

I get output like this for (1 shl 20) KiB == 1 GiB working set: 


shell$ memlat -kr; memlat -ks
KiB: 1048576 ns/Access: 8.197 +- 0.002 s:71968628
KiB: 1048576 ns/Access: 67.330 +- 0.078 s:1295362736


Run

I am pretty sure that what is going on is that the work to make the next 
pseudorandom number and the subsequent cache line load is predictable. So, 
speculative execution is letting the CPU "work ahead" of the memory system by 
many loops, and its coordinated prefetching then masks the memory system 
latency quite well, yielding a kind of "amortized/fake latency". Meanwhile the 
CPU cannot predict the bouncing randomly around the array loads because the 
next hop isn't known until the last hop is done. So, it has to actually wait on 
the DIMMs and you get to see the real round-trip latency.

This may all seem "off track/topic", but many benchmarks have the pattern of 
the former rather than the latter and then fail to represent real world 
performance creating confusion. For example, hash table benchmarks often look 
_just_ like the above random element case. One example of pros getting that 
wrong, Table 1 in Kraska 2017 
([https://arxiv.org/abs/1712.01208](https://arxiv.org/abs/1712.01208)) has 
crazy good numbers like 31 nanoseconds for hash lookup latency. I don't think 
anyone makes DIMMs with anything **_near_** that kind of latency and if those 
were being used it would require special mention (not that the evaluation 
sections of those authors ever stray too close to reproducibility or 
analyzability...). Anyway, that 31 ns is almost surely amortized/fake latency. 
I think even all the blogger follow-up to Kraska2017 failed to observe this 
particular issue. Hardware manufacturers worsen this whole situation by lying 
about/quoting amortized instead of true round trip latency (what "latency" has 
meant my whole life anyway).

Another comment about that benchmark code is that even putting 
`rdtsc`/epochTime/e

Re: Introducing --gc:arc

2020-01-08 Thread cblake
One other thing you might do, and I hesitate to make this recommendation 
because it's still a decidedly difficult to reproduce scenario, is to take the 
min of 100 trials of the max of 1 or something like that.

The trick there is making sure the "time of interest" for that max of 1 is 
short enough that the min has a > 1% chance of filtering out all extraneous 
"unwanted" system activity that might be sampled. There is other pressure 
wanting to make that time long - such as being able to probe L3/DIMM-resident 
behavior of things.

You might not be able to get even a 1% chance of a quiescent system for the 
duration of that 1 trial thing. On unix, you could shutdown the window 
system/all the services and say, be in single user mode..Maybe even unplug the 
network cable and definitely don't make any keystrokes after you press ENTER. 
Presumably what you are after is "Worst case-ness of just GC related things".

That might be a start to a measuring mode of assessing it instead of just 
"pondering". I'm not sure it's much better than pondering, though, and 
definitely deserves fair warning to would be reproducers.


Re: Introducing --gc:arc

2020-01-08 Thread cblake
Some guy wrote his own "measurement OS" for studying CPU behavior in light of 
Spectre type information leak vulnerabilities. I forget the name of it at the 
moment. I do not think he open sourced it, though.

Anyway, if you don't have something like that handy and you don't want to chase 
performance ghosts, what I would recommend is an L-average of the M-min of the 
N-max (basically like my `memlat` code) where you tune the loops sizes to get 
small standard deviations in the outermost and N is probably adjusted based on 
L1/L2/L3/DIMM levels. If you use the smallest max loop that can measure what 
you're after that will "multiply", meaning smaller min and averaging loops will 
get you stable numbers.

Even in single user mode with no network cable, **_and_** with low standard 
deviations in the outermost loop, depending on how many milliseconds the max 
loop takes, you might still be measuring "time you want" \+ "a not varying much 
scheduled time/L3 sharing impact of the max number of Linux kernel threads can 
line up against my program at once during my max loop". You may be able to use 
`chrt -r 99 myprogram` or `taskset` to minimize that. Chances are that if the 
inner max loop is pretty short lived, "not varying much competers" is unlikely, 
and so a small standard deviation is unlikely to be a false flag.

Past that step, there is also a way to boot a Linux kernel with some CPU cores 
"taken out of scheduler availability" (`isolcpus`) and then when you use 
`taskset` to assign some process to this isolated core that is the only thing 
that runs on that core, period. That probably reduces the "what am I 
measuring?" uncertainty to virtual memory system and L3 sharing impacts. I 
doubt many/any Linux kernel threads are very VM/L3 intensive when in 
single-user mode. So, that's probably about as close as you could get, and 
actually probably pretty close. It's admittedly all a bit of a PITA, but 
probably less so than buying a real-time OS or writing your own. :-)

Anyway, maybe everyone knew all this already, but it seemed worth saying in 
light of that code @miran linked to.


  1   2   3   >