On Sat, Aug 6, 2016 at 2:49 PM, Kevin O'Gorman <kevinogorm...@gmail.com> wrote:
> On Sat, Aug 6, 2016 at 2:09 AM, Dan Kennedy <danielk1...@gmail.com> wrote: > >> On 08/06/2016 09:52 AM, Kevin O'Gorman wrote: >> >>> On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedy <danielk1...@gmail.com> >>> wrote: >>> >>> On 08/06/2016 03:28 AM, Kevin O'Gorman wrote: >>>> >>>> On Fri, Aug 5, 2016 at 1:08 PM, David Raymond <david.raym...@tomtom.com >>>>> > >>>>> wrote: >>>>> >>>>> ...... >>>>> >>>>> Apart from the default location of the files, it reads like your next >>>>> main >>>>> >>>>>> concern is how many temp files get opened up. My bet is that it'll be >>>>>> a >>>>>> very small number, just potentially huge in file size while it's doing >>>>>> its >>>>>> thing. But again, try that pragma and take a look. >>>>>> >>>>>> My best bet is the contrary: it starts with small files and makes >>>>>> >>>>> increasingly larger ones, like the sort utility does. The problem is >>>>> that >>>>> there are too many of them at the beginning for it to work with >>>>> anonymous >>>>> files (which sort does not use). This at least offers a possible >>>>> explanation of its getting wedged on large indexes: an unexpected and >>>>> untested error, handled poorly. >>>>> >>>>> You could verify this by checking the number of open handles in >>>> "/proc/<pid>/fd" after your process is wedged. >>>> >>>> Excellent idea. I did not know about that possibility. And sure >>>> enough, >>>> >>> I'm wrong. It's using anonymous files, all right, but only one or two >>> at a >>> time. I assume they're big. I'm in the process of bracketing where size >>> begins to matter. So far, 1/10 of the data loads and indexes just fine, >>> albeit somewhat more slowly that the smaller samples predicted. The >>> database load took 6.5 minutes, the troublesome index 10 minutes. At >>> smaller sizes, indexing is faster than the database load. >>> >>> I'm trying 1/3 now (500 million lines) >>> >> >> What does [top] tell you once the process becomes wedged? What percentage >> is the CPU running at? Or is it completely bogged down waiting for IO? >> >> Dan > > > I'm waiting for a good time to get the answer to this. It takes a good > long while to get to wedged, so I'll probably do it overnight tonight. > RETRACTION: it doesn't get wedged after all, it just takes about 4 times longer than I expected. On small inputs, it makes the indexes faster than the data table. When I test on the whole data file, it takes 4 times longer to build each index than it took to make the table. I guess that's what it looks like when building the table is O(n) complexity (linear), and the index is O(n * log(n)) and log(n) is getting bigger. Sorry for the kerfluffle. This is my first time working with datasets this big. Since it takes almost 6 hours to build each index (there are 2 at this point), and I didn't know to expect that, I drew the wrong conclusion. My bad. I'm just glad Mr. Hipp got into this thread and advised to build the indexes after the data is loaded. It would have been _lots_ slower inserting each index individually, and in fact my first attempt -- the one that gave me the initial idea that this had wedged -- took this approach and I decided it was wedged after a few _days_. I suspect it wasn't either, but might as well have been. So far, this has been a feasibility exercise. Building a 500 GB database overnight is definitely feasible. Now I can go on to see if it helps me solve my problem. -- #define QUESTION ((bb) || (!bb)) /* Shakespeare */ _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users