fetching data from many small .txt files

2023-05-03 Thread tcheran
@cblake, sorry for the late follow up. Each object is described by a .TXT file, containing only grid pixels (one line is a grid pixel) having a numeric value (float). No, the grid is not the same across the .TXT files, and a pixel may contain more than one object values. The subset I used to run

fetching data from many small .txt files

2023-05-03 Thread Zoom
It's a reasonable question, but when the issue is this specific it's safe to assume it's what one's been given and can hardly be changed. There's a bunch of inefficiencies in the format but when you already have the data it's faster to process it as it is, than converting it to a proper form bef

fetching data from many small .txt files

2023-05-03 Thread treeform
Have you though of maybe doing an initial step puts all of your small files into a database or some sort of in memory store? Using a bunch of small files as normal runtime operation is not a great idea. You are at the mercy of inefficient file system no matter which way you slice it.

fetching data from many small .txt files

2023-05-02 Thread giaco
fyi GeoPackage is the new(ish) Spatialite (different author, but standardized by OGC)

fetching data from many small .txt files

2023-05-02 Thread Zoom
Some additional observations: * Running on tmpfs on Linux is ~ 30% faster than with ImDisk on W10 for me. * WinFSP is still unreliable for any serious work. The accumulated csv was just truncated on write with no errors. * Using channels for threading gives a significant memory overhead. So

fetching data from many small .txt files

2023-05-02 Thread cblake
First, always best to have reproducible test data! Great initiative, @Zoom! @tcheran did not specify if the grid was fixed over samples or varying. Either could make sense (e.g. with wandering sensors that use the GPS satellite network to self-locate), but very different perf numbers & optimizat

fetching data from many small .txt files

2023-05-01 Thread Zoom
Played with it for a bit. What can I say, [ImDisk](https://sourceforge.net/projects/imdisk-toolkit) is really slow. May be [WinFsp's](https://winfsp.dev/) ram drive is faster, had no time to check yet. In regular circumstances I'm pretty content with the former. Adding threads brought only negl

fetching data from many small .txt files

2023-05-01 Thread tcheran
Amazing! Yeah, new release it solved. I removed cligen 1.6.1 and installed 1.6.2 on my Work laptop (the one with Nim 1.4.4), and compilation was just fine. I also ran a few tests on subset of data and (I needed to throw out the first run, as always) the cligen-powered version is around 20% faste

fetching data from many small .txt files

2023-05-01 Thread cblake
I believe you may be running into this bug: Let me punch a new cligen-1.6.2 release for you. Give me a few minutes.

fetching data from many small .txt files

2023-05-01 Thread tcheran
@cblake Well, I tried a bit more with your cligen powered memory map code,... **Work Laptop with Windows 10 (Nim 1.4.4)** : initially compiler complained about a missing c header... but I was running an old version of cligen so I uninstalled it and installed again 1.6.1 with nimble. The missing

fetching data from many small .txt files

2023-05-01 Thread cblake
Huh. Well, if you got only a 5% speed-up then that is consistent with your prior IO bound claims, but also consistent with said IO being very slow.. maybe from anti-malware as you propose. The Defender stuff could be intercepting system calls, too. That might be another reason to try the `std/m

fetching data from many small .txt files

2023-05-01 Thread tcheran
Hi, thank you all for your suggestions. @cblake, your comment `{ EDIT1: but I am a bit skeptical that you timed things right as with the code you showed you are unlikely to parse & print as quickly as even a SATA SSD never mind an NVMe SSD.` was most likely right. The Windows PC I'm using is a D

fetching data from many small .txt files

2023-04-30 Thread ingo
Depending on what you do with the data, store them in SQLite (or Spatialite if the coordinates are geographical).

fetching data from many small .txt files

2023-04-30 Thread alexeypetrushin
May worth to try a) use async b) use caching, so you don't have to process those files every time.

fetching data from many small .txt files

2023-04-30 Thread cmc
Seems like a good use-case for my LimDB. It's a table-like interface to a mature key-value database based on memory-mapped files . For smallish strings, this is usually a lot faster than file access because once the data is loaded the first time round, your code won't be doing calls to the kerne

fetching data from many small .txt files

2023-04-30 Thread cblake
I suspect there may be system settings to optimize small file IO on Windows 10, but I am not the person to ask and that is actually not very Nim-specific. I will observe that 3,000 lines of 40-ish byte lines is like 120 KiB or 30 virtual memory pages which may not be what everyone considers "sma

fetching data from many small .txt files

2023-04-30 Thread tcheran
Hi, I was wondering if there is a way to optimize this kind of data processing. I'm using Windows 10, and I need to parse many relatively small .txt files and rearrange their content in a table, where table index is a pair of 2D grid coordinates and table value is a sequence of strings. The orde