Re: Fastest way to count number of lines

2017-10-21 Thread cblake
Nice, boia01! In my same `/usr/share/nim/**.nim` test I get 768 microseconds for your version and 2080us for just doing the memSlices approach..So, 2.7X speed up, a bit less than the 4X I saw when I last compared the approach in two C versions..maybe unrolling. Dunno. @alfrednewman - if the sta

Re: Fastest way to count number of lines

2017-10-21 Thread boia01
Just for fun, I ported and hacked together a self-contained Nim version of Daniel Lemire's avxcount: [https://gist.github.com/aboisvert/3f89bc0ae0a2168fcf35ccca98177f6a](https://gist.github.com/aboisvert/3f89bc0ae0a2168fcf35ccca98177f6a) (I didn't bother with the loop-unrolled versions)

Re: Fastest way to count number of lines

2017-10-21 Thread alfrednewman
Hi, thanks for the help of all of you. Yes, I'm pre-calculating things. In the data orchestration process I'm involved in, I can usually estimate the time of a rendering based on the number of rows I'm processing. It is a linear process and the processing time is typically not much affected as

Re: Fastest way to count number of lines

2017-10-21 Thread cblake
Yeah..Depending on what he's doing, same-file dynamic estimation might also work. Good point, @jlp765. On my system the 5.4 MB of `/usr/share/nim/**.nim` gets counted in about 4 milliseconds - over 1.3 GB/sec, probably faster than all but the most powerhouse nvme/disk array IO. This is why I su

Re: Fastest way to count number of lines

2017-10-20 Thread jlp765
How giant is a "giant text file"? On my machine a 75M file takes roughly 0.12 sec to count the lines (it is dummy data, so not very random). If GigaBytes in size, then close enough might be good enough I didn't see @cblake mention it, but you could count the bytes to read 100 lines of a big f

Re: Fastest way to count number of lines

2017-10-20 Thread cblake
So, one other thing that is _probably_ obvious but bears mentioning just in case it didn't occur to @alfrednewman - the number of addressable/seekable bytes in a file is usually maintained by any modern filesystem on any modern OS as cheaply accessed metadata. So, if what you really need is not

Re: Fastest way to count number of lines

2017-10-20 Thread Stefan_Salewski
Please also compare this thread: [https://forum.nim-lang.org/t/1164#18006](https://forum.nim-lang.org/t/1164#18006) I have not yet used SIMD instructions myself in Nim, but there are some hints in the Forum already. For line counting, the different end-of-line marks for Unix/Windows/Mac makes

Re: Fastest way to count number of lines

2017-10-20 Thread alfrednewman
Guys, thank you for your help. @Stefan_Salewski, yes speed is an important point for me. I found the link you provided (about SMID) very interesting ... however, I do not know how to do this using Nim. Could you please help? Even to help newbies like me, thought to include the response of this

Re: Fastest way to count number of lines

2017-10-20 Thread Stefan_Salewski
If speed is really important for you, you may consider SIMD instructions. D. Lemire gave an example for this in his nice blog: [https://lemire.me/blog/2017/02/14/how-fast-can-you-count-lines](https://lemire.me/blog/2017/02/14/how-fast-can-you-count-lines)/

Re: Fastest way to count number of lines

2017-10-20 Thread cblake
@jlp765 - good catch. I thought of that, too (I actually wrote that `memSlices` stuff), and almost went back and added a note later, but you beat me to it. I still am unaware about relative timings on platforms other than what I personally use and would be interested to hear reports, but on Lin

Re: Fastest way to count number of lines

2017-10-20 Thread jlp765
Even faster (avoiding some string allocations) import memfiles for line in memSlices(memfiles.open("foo")): inc(i)

Re: Fastest way to count number of lines

2017-10-20 Thread cblake
It sounds like you will have many regular files (i.e., random access/seekable inputs as opposed to things like Unix pipes). On Linux with glibc, memfiles.open is probably the fastest approach which uses memchr internally to find line boundaries. E.g. (right from memfiles documentation),

Fastest way to count number of lines

2017-10-20 Thread alfrednewman
Hello, Before processing a giant txt file, I need to know in advance how many lines that file has. Since I will have to process multiple files it would be important to perform this line counting operation as quickly as possible. What is the fastest way to know how many lines a txt file has? I