date:20171020

Re: Fastest way to count number of lines

2017-10-20 Thread jlp765

How giant is a "giant text file"?

On my machine a 75M file takes roughly 0.12 sec to count the lines (it is dummy 
data, so not very random).

If GigaBytes in size, then close enough might be good enough 

I didn't see @cblake mention it, but you could count the bytes to read 100 
lines of a big file and use that to estimate the overall number of lines.

Re: Fastest way to count number of lines

2017-10-20 Thread cblake

So, one other thing that is _probably_ obvious but bears mentioning just in 
case it didn't occur to @alfrednewman - the number of addressable/seekable 
bytes in a file is usually maintained by any modern filesystem on any modern OS 
as cheaply accessed metadata. So, if what you really need is not an exact count 
but a "reasonable bound" that could be violated once in a while then you may be 
able to _really_ accelerate your processing.

As one example text corpus that we all have access to, the current Nim git 
repo's `lib/nim/**.nim` files have an average length of 33 bytes and a standard 
deviation of about 22 bytes as assessed by this little program: 


import os, memfiles
when isMainModule:
  var sumSq = 0
  for slc in memSlices(inp):
inc(counter)
sumSq += slc.size * slc.size
  echo "num lines: ", counter
  let mean = float(inp.size) / float(counter)
  echo "mean length: ", mean
  let meanSq = float(sumSq) / float(counter)
  echo "var(length): ", meanSq - mean*mean


You could probably reasonably bound the average number of bytes per line as, 
say (average + 4*stdev) which in this case is about 33+22*4 =~ 121 bytes..maybe 
round up to 128. Then you could do something like: 


import os
var reasonableUpperBoundOnLineCount = int(float(getFileInfo(myPath).size) / 
float(128))


If you use that bound to allocate something then you are unlikely to over 
allocate memory by more than about 4X which isn't usually considered "that bad" 
in this kind of problem space. Depending on what you are doing you can tune 
that parameter and you might need to be prepared in your code to "spill" past a 
very, very rare 4 standard deviations tail event. This optimization will beat 
the pants off any even AVX512 deal that iterates over all the file bytes at 
least for this aspect of the calculation. It basically eliminates a whole pass 
over the input data in a case that is made common by construction.

Since you have seemed pretty dead set on an exact calculation in other posts, a 
small elaboration upon the "embedded assumptions" in this optimization may be 
warranted. All that is really relied upon is that some initial sample of files 
can predict the distribution of line lengths "well enough" to estimate some 
threshold (that "128" divisor") that has "tunably rare" spill overs where they 
are rare enough to not cause much slowdown in whatever ultimate calculation you 
are actually doing which you have not been clear about.

Another idea along these lines, if, say, the files are processed over and over 
again, is to avoid re-computing all those `memchr()` s by writing a little 
proc/program to maintain an off-to-the-side file of byte indexes to the 
beginnings of lines. The idea here would be that you have two sets of files, 
your actual input files and some paired file "foo.idx" with foo.idx containing 
just a bunch of binary ints in the native format of the CPU that are either 
byte offsets or line lengths effectively caching the answer of the `memchr`.

If you had such index files then when you want to know how many lines a file is 
you can `getFileInfo` on the ".idx" file and know immediately. You could be 
careful and check modification times on the .idx and the original data file and 
that sort of thing, too. Why, you could even add a "smarter" `memSlices` that 
checked for such a file and skipped almost all its work and an API call 
`numSlices` that skipped all the work if the ".idx" file is up-to-date 
according to time stamps.

Basically, except for actually "just computing file statistics", it seems 
highly unlikely to me that you should really be optimizing the heck out of 
newline counting in and of itself beyond what `memSlices/memchr()` already do.

Which FUSE library shall I use?

2017-10-20 Thread joshbaptiste

Hello all, After some searching I noticed 2 libraries used for FUSE

[https://github.com/zielmicha/reactorfuse](https://github.com/zielmicha/reactorfuse)

[https://github.com/akiradeveloper/nim-fuse](https://github.com/akiradeveloper/nim-fuse)

I'm starting a FUSE project and was wondering which is best to use with current 
Nim 17.x?

Re: What's happening with destructors?

2017-10-20 Thread Araq

> Is the doc on regions now obsolete? It would seem that destructors now 
> obviate much of the need for regions.

It's a bit early to say but I think so, yes.

> Is all of that coming, or is it "just an idea" you want feedback on?

Destructors, assignment operators, the move optimization are coming behind a 
`--newruntime` switch and expected to be useful within days/weeks for you to 
tinker with, how to introduce even more moves ("sink parameters?") is unclear. 
The other outlined features have no ETA.

The really hard part is replacing the existing runtime with one with a 
different performance profile ("yay, deterministic freeing, yay more efficient 
multi threading possibilities, ugh, overall slower?!") and is likely stuff for 
Nim v2.

But yeah, feedback is always appreciated.

Re: What's happening with destructors?

2017-10-20 Thread monster

@Araq I read the blog post. I have two questions:

1) Is all of that coming, or is it "just an idea" you want feedback on? I 
learned the new C++ 11 features recently, and most (all?) of those I though 
were missing in Nim seem to be described in that post.

2) Is there already some rough ETA for "general availability"?

Re: what does macros.quote() do exactly?

2017-10-20 Thread bkerin

Seems to give almost the same output:

> type
> 
> 
> Dumb112023 = ref object of RootObj
> contents: int
> method frobnicate(this112025: Dumb112023) {.base.} =
> echo "frobnicating!"

and I still don't understand why Dumb turns into Dumb112023, maybe I will later

Re: Problem using

2017-10-20 Thread jzakiya

The error messages keep saying the issue is a mismatch with FlowVar[T]. In 
Chapter 6 of **Nim in Action** here is what it says they are.

`FlowVar[T] can be thought of as a container similar to the Future[T] type, 
which you used in chapter 3. At first, the container has nothing inside it. 
When the spawned procedure is executed in a separate thread, it returns a value 
sometime in the future. When that happens, the returned value is put into the 
FlowVar container.`

Here is updated `segcount`


proc segcount(row, Kn: int): uint =
  var cnt = 0'u
  for k in 0..

Re: What should d0m96 work on in his next Nim livestream?

2017-10-20 Thread Libman

I think the most valuable effort is that which is effective at attracting more 
people to Nim, which is a force multiplier for less inspirational tasks.

Everyone knows that things like IRC and other library improvements can be done 
given someone's time and effort. But many people are judging what Nim is 
capable of based on its server side web framework, and Nim has been MIA in the 
most popular web framework benchmarks.

Re: What's happening with destructors?

2017-10-20 Thread bpr

Thanks @Araq! I'll watch the livestream later.

Is the doc on regions now obsolete? It would seem that destructors now obviate 
much of the need for regions.

Re: Windows installation

2017-10-20 Thread alfrednewman

https://github.com/dom96/choosenim

Re: Windows installation

2017-10-20 Thread Araq

Download what it says and run `finish.exe`. Can't get much simpler than that. 
And the mingw project keeps changing its installer. It varies from good to 
awful so I prefer to not depend on it.

Re: Problem using

2017-10-20 Thread Stefan_Salewski

Why do you refuse to try


cnt[i]


as jlp765 suggests?

Do you have an idea how plain


cnt +=


should work? All the parallel calculated results should accumulate in this 
single variable. Then you may need something to control the access to it.

Re: Problem using

2017-10-20 Thread jzakiya

In the previous snippet I forgot the `spawn`. The code below compiles, but is 
slower.


  var cnt = 0# count for the primes, the '1' bytes
  for i in 0..

Re: Problem using

2017-10-20 Thread jzakiya

When I use `parallel` I get this compiler error:


  parallel:
var cnt = 0'u  # count for the nonprimes, the '1' 
bytes
for i in 0..rescnt-1:  # count Kn resgroups along each 
restrack
  cnt += spawn segcount(i*KB, Kn)  <-- points to start of '('
  sync()
  primecnt += cnt
-
ssozp5x1c1par.nim(155, 28) Error: 'spawn' must not be discarded

Windows installation

2017-10-20 Thread freeflow

Hi

Just tried installing Nim following the directions from here

[https://nim-lang.org/install_windows.html](https://nim-lang.org/install_windows.html)

You _CAN_ do better than this.

For example why are you not linking to

[https://sourceforge.net/projects/mingw-w64](https://sourceforge.net/projects/mingw-w64)/

rather than

[https://nim-lang.org/download/mingw64-6.3.0.7z](https://nim-lang.org/download/mingw64-6.3.0.7z)

Re: nim-cookbook

2017-10-20 Thread wizzardx

Update: Looks like I can't delete the wiki directly!

(I think this is for good reason; on a case by case basis - so that eg, rogue 
admins can't take their community hostage or whatever).

Have made an official request to wikia to delete it, opening ticket #349088 on 
their system.

It should hopefully be deleted within 2 business days.

Wikia wiki is currently over here, but should hopefully disappear soon:

[http://nim-lang.wikia.com/wiki/The_Nim_programming_language_Wiki](http://nim-lang.wikia.com/wiki/The_Nim_programming_language_Wiki)

Re: What should next Araq's live stream be about?

2017-10-20 Thread Tiberium

The recording is now available! 
[https://www.youtube.com/watch?v=KNUDGZuqfQM](https://www.youtube.com/watch?v=KNUDGZuqfQM)

Re: nim-cookbook

2017-10-20 Thread wizzardx

Hmm, then I'll put that on hold for now 

Already created the wikia, but I'll see if I can nuke it.

Re: What should next Araq's live stream be about?

2017-10-20 Thread dom96

@wizzardx: Glad you like it. You can already see all of my streams, they're all 
in [this 
playlist](https://www.youtube.com/watch?v=UQ4RvUlXIDI&index=3&list=PLm-fq5xBdPkrMuVkPWuho7XzszB6kJ2My).
 And Araq's are in [his 
channel](https://www.youtube.com/channel/UCAIXKsgiEkRjwlNgduABgmw).

We don't really have a formal way to give suggestions on what we should do in 
our livestreams. Best way is to just tell us on IRC I guess 

For me, Twitter will also work if that's easier. For IRC you can also use 
Gitter.

Re: nim-cookbook

2017-10-20 Thread dom96

IMO wikia is pretty bad because it includes ads and is overall very bloated.

We used to have mediawiki (8+ years ago :)) but it attracted too many spammers.

* * *

As for the cookbook, awesome job! Please add a link to it on the [Nim 
website](https://github.com/nim-lang/website) (by creating a PR).

Re: What should d0m96 work on in his next Nim livestream?

2017-10-20 Thread dom96

Thanks for posting this @Tiberium 

Thanks for all that voted as well.

Re: Fastest way to count number of lines

2017-10-20 Thread Stefan_Salewski

Please also compare this thread:

[https://forum.nim-lang.org/t/1164#18006](https://forum.nim-lang.org/t/1164#18006)

I have not yet used SIMD instructions myself in Nim, but there are some hints 
in the Forum already.

For line counting, the different end-of-line marks for Unix/Windows/Mac makes 
it a bit more complicated unfortunately.

Re: nim-cookbook

2017-10-20 Thread alfrednewman

@wizzardx, please, go ahead !

I think your initiative (following the idea of the cookbook concept of 
@nimboolean) is completely valid

Anyway, @nimboolean already posted some interesting content at 
[https://github.com/btbytes/nim-cookbook/](https://github.com/btbytes/nim-cookbook/).
 So I think we would have to transcribe what's there for the new Wiki

Cheers !

Re: Fastest way to count number of lines

2017-10-20 Thread alfrednewman

Guys, thank you for your help.

@Stefan_Salewski, yes speed is an important point for me. I found the link you 
provided (about SMID) very interesting ... however, I do not know how to do 
this using Nim. Could you please help?

Even to help newbies like me, thought to include the response of this thread in 
the cookbook wiki being created  as per 
[https://forum.nim-lang.org/t/3259](https://forum.nim-lang.org/t/3259)

Re: nim-cookbook

2017-10-20 Thread wizzardx

Ideally: One of the core devs puts a mediawiki instance under the nim-lang.org 
server.

But failing that, Wikia is a very nice place for starting and maintaining 
community wikis, eg:

[http://bleach.wikia.com/wiki/Bleach_Wiki](http://bleach.wikia.com/wiki/Bleach_Wiki)

Edit: Should I take initiative and start it myself?

I don't want to steal anyone's thunder, or do something that's too 
controvercial amongst the Nim community

Re: Problem using

2017-10-20 Thread jlp765

Try


 for i in 0..rescnt-1:
cnt[i] = spawn segcount(i*KB, Kn)


I think the issue is with `..<`

Re: nim-cookbook

2017-10-20 Thread alfrednewman

@wizzardx This is nice. Your idea really makes more sense. Count on me to help. 
How to configure a wiki ? What would be the best link (URL)?

Re: Beginner question about nil access

2017-10-20 Thread Arrrrrrrrr

The problem here is that result is not initialized to "" by default. As stated, 
this will change in the future.

Re: Beginner question about nil access

2017-10-20 Thread wizzardx

Some things I _really_ like here about nim vs many other proglangs.

1\. Nil access actually segfaults, rather than silent or undefined behavior.

2\. You get a really awesome stack trace.

I use "not nil" wherever I can, too, but it's kind of a losing battle, since 
all the other code wants nillable types, so you have to add a lot of 
converters/checking code/etc, which kinda defeats the purpose

Re: What's happening with destructors?

2017-10-20 Thread Araq

Ok, blog post is here: 
[https://nim-lang.org/araq/destructors.html](https://nim-lang.org/araq/destructors.html)

I will copy the "spec worthy" parts that have been implemented already into a 
wiki page.

Re: What should next Araq's live stream be about?

2017-10-20 Thread wizzardx

Really really cool idea; I'm planning to watch all the livestreams; at least 
once they hit youtube . I've subscribed to both Araq's and Dom's channels on 
there!

Is there a suitable place for adding suggestions for future live streaming 
subjects?

Is that best brought up in IRC perhaps? (have never been on there; I'm more of 
a forum dweller).

Re: nim-cookbook

2017-10-20 Thread wizzardx

Any chance we could do this on a regular wiki, rather than needing to make 
github pull requests?

eg like this:

[https://www.renpy.org/wiki/renpy/doc/cookbook/Cookbook](https://www.renpy.org/wiki/renpy/doc/cookbook/Cookbook)
 .

Re: Problem using

2017-10-20 Thread Stefan_Salewski

Now in your code there is no spawn at all!

For parallel processing, you have to ensure that there are no conflicts when 
parallel tasks are accessing your data, otherwise the compiler may make copies 
of the data before, which may make it slow. And for parallel processing a good 
use of the CPU cache is also important -- many parallel processes will give no 
speed increase when data is always fetched from slow RAM instead of cache.

Re: Problem using

2017-10-20 Thread jzakiya

After reading the **Nim in Action** book I got it to compile by placing a `^` 
before `spawn`, but it makes the program slower. The problem has to do with 
`segcount` returning a `FlowVar[T]` mismatch. And when I use `parallel:` it 
won't compile, and shows even more errors. Doing more research.


  var cnt = 0# count for the primes, the '1' bytes
  for i in 0..

Re: Fastest way to count number of lines

2017-10-20 Thread Stefan_Salewski

If speed is really important for you, you may consider SIMD instructions.

D. Lemire gave an example for this in his nice blog:

[https://lemire.me/blog/2017/02/14/how-fast-can-you-count-lines](https://lemire.me/blog/2017/02/14/how-fast-can-you-count-lines)/

Re: Fastest way to count number of lines

2017-10-20 Thread cblake

@jlp765 - good catch. I thought of that, too (I actually wrote that `memSlices` 
stuff), and almost went back and added a note later, but you beat me to it. 

I still am unaware about relative timings on platforms other than what I 
personally use and would be interested to hear reports, but on Linux/glibc 
`memSlices` (or more generally `mmap+memchr` however that is invoked) is always 
fastest in my tests.

Re: Fastest way to count number of lines

2017-10-20 Thread jlp765

Even faster (avoiding some string allocations)


import memfiles
for line in memSlices(memfiles.open("foo")):
  inc(i)

Re: Fastest way to count number of lines

2017-10-20 Thread cblake

It sounds like you will have many regular files (i.e., random access/seekable 
inputs as opposed to things like Unix pipes). On Linux with glibc, 
memfiles.open is probably the fastest approach which uses memchr internally to 
find line boundaries. E.g. (right from memfiles documentation), 


import memfiles
for line in lines(memfiles.open("foo")):
  inc(i)


Your mileage on this may vary from OS to OS or libc to libc. I have no idea 
which if any Microsoft/Windows versions have well-optimized libc memchr() 
implementations.

Fastest way to count number of lines

2017-10-20 Thread alfrednewman

Hello,

Before processing a giant txt file, I need to know in advance how many lines 
that file has. Since I will have to process multiple files it would be 
important to perform this line counting operation as quickly as possible.

What is the fastest way to know how many lines a txt file has?

I am currently using the following: 


for line in lines "input.txt":
inc(i)


For some reason I think the code is very simple and should have some way to do 
it in a faster way...

Re: nim-cookbook

2017-10-20 Thread alfrednewman

@nimboolean, I will help you out with PRs you very soon... thanks

Re: Beginner question about nil access

2017-10-20 Thread jlp765

You can't add to a Nil. Firstly initialize it.


proc formatTodos(list: TodoList): string =
  result = ""
  for todo in list.items():
result.add("Todo: " & todo.desc)
result.add("\n")


(insert Viccini saying: "you fell for one of the classic blunders ") 

There was talk of making "seq" and other "array like" variables to auto 
initialise to stop this happening.

Not sure if that will make Nim 1.0 or if it will happen at all (or maybe the 
default "not nil" will be the solution)

41 matches

Mail list logo