Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

Dan Kennedy Fri, 05 Aug 2016 14:03:13 -0700

On 08/06/2016 03:28 AM, Kevin O'Gorman wrote:

On Fri, Aug 5, 2016 at 1:08 PM, David Raymond <david.raym...@tomtom.com>
wrote:

There's a depreciated pragma, PRAGMA temp_store_directory =
'directory-name'; that apears to still work ok for now at least.
http://www.sqlite.org/pragma.html#pragma_temp_store_directory

I've used it for the same reasons you've both mentioned about space,
though for me it vacuuming a huge db, and the full size db copy that makes
which ate up my entire C drive. But with that pragma you can specify
exactly which folder you want to use for your temp files. I'd suggest try
using that, then monitor the folder you give it to see what shows up.

Since it's deprecated, I'll stick with TMPDIR because it's pretty standard
on Linux (it also works with the sort utility for instance).

Apart from the default location of the files, it reads like your next main
concern is how many temp files get opened up. My bet is that it'll be a
very small number, just potentially huge in file size while it's doing its
thing. But again, try that pragma and take a look.

My best bet is the contrary:  it starts with small files and makes
increasingly larger ones, like the sort utility does.  The problem is that
there are too many of them at the beginning for it to work with anonymous
files (which sort does not use).  This at least offers a possible
explanation of its getting wedged on large indexes: an unexpected and
untested error, handled poorly.

You could verify this by checking the number of open handles in"/proc/<pid>/fd" after your process is wedged.


<Insert normal disclaimers about suggested use of a depreciated feature>

-----Original Message-----
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
On Behalf Of Kevin O'Gorman
Sent: Friday, August 05, 2016 3:41 PM
To: SQLite mailing list
Subject: Re: [sqlite] newbie has waited days for a DB build to complete.
what's up with this.

On Fri, Aug 5, 2016 at 12:30 PM, Igor Korot <ikoro...@gmail.com> wrote:

Hi, Kevin,

On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman <kevinogorm...@gmail.com>
wrote:

Okay, I followed some of the advice y'all gave and got some results.

1. The original problem was compromised by malformed input.  However,

it

appears that did not cause the wedging of the process.  See (3) below.

Where are the data will come?
 From the user? Internet?

What I'm getting at is - you need to look for some malformed data in
the future as well.

I generate it.  I goofed, and I'll try not to goof in the future.

2. I separated the steps, and started small.  Time increased slightly
sub-linearly with dataset size, so I jumped to doing the whole thing.

With

proper input, the data was loaded in 68 minutes.

3. The CREATE INDEX steps failed quickly (2 minutes), reporting

"database

or disk is full" which seemed odd since most of my partitions have much
more free space than the entire database.  It turns out that whatever

does

the creation was using space on my root partition (this is Linux, so

that

means "/").  That's the only partition in my setup without a huge

amount

of

free space.  On would expect temporary stuff to go to /tmp (which has

3TB

free), but it doesn't go there.  I would go there if the system's

native

"sort" program were used.  Fortunately, it turns out that the TMPDIR
environment variable is honored, but while I could see space was being
used, there were no files visible.  I take that to mean that the

tmpfile()

function (or equivalent) was used.  This could be a bad idea for large
indexes because anonymous files have to be kept open, and there's a

limit

on the number of files that can be open at a time, around 1,000.  Sure
enough, the index creation appears to be wedged like the original run,

and

after a few hours I killed it manually.  This is a deal-killer.

The failure you saw - is it on the table with the complete data set?
Or you got it during the experimenting?

Only on the complete data set.

So the questions are: Where do bug reports go?  I seem to be running

3.8.2;

is this fixed in any later version?

You can try the "3.14" pre-released one right now. ;-)

Meh.  I submitted a bug report to this list.  I'll see what happens.

Thank you.


On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman <

kevinogorm...@gmail.com>

wrote:

The metric for feasability is coding ease, not runtime.  I'm the
bottleneck, not the machine, at least at this point.

As for adding rows, it will be about like this time: a billion or so

at

time.  But there's no need to save the old data.  Each round can be
separate except for a persistent "solutions" table of much more modest
size.  I've been doing this for a while now, and the solutions file

has

only 10 million or so lines, each representing a game position for

which

optimum moves are known.  Getting this file to include the starting
position is the point of the exercise.

If I ever get to anything like "production" in this project, I expect

it

to run for maybe three years...  That's after I tweak it for speed.

Background: in production, this will be running on a dual-Xeon with 16
cores (32 hyperthreads) and 1/4 TiB RAM.  It has sequential file

update

through Linux flock() calls at the moment.  The code is bash gluing
together a collection of UNIX utilities and some custom C code.  The C

is

kept as simple as possible, to minimize errors.

As you may surmise, this "hobby" is important to me.


On Thu, Aug 4, 2016 at 9:09 AM, R Smith <rsm...@rsweb.co.za> wrote:


On 2016/08/04 5:56 PM, Kevin O'Gorman wrote:

On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne <

ddevie...@gmail.com>

wrote:


It's even less dense than that.  Each character has only 3 possible
values,
and thus it's pretty easy to compress down to 2 bits each, for a 16

byte

blob.
It's just hard to do that without a bunch of SQLite code I'd have to
learn
how to write.  The current effort amounts to a feasibility study,

and

want
to keep it as simple as possible.

A feasibility study using equipment that are hamstrung by weights

they

won't have in the real situation is not an accurate study.

It's like studying fuel consumption on a different kind of road

surface,

but for the test purposes, the cars had to tow caravans containing

their

testing equipment - the study will not look feasible at all.

It might of course be that the feasibility you are studying is

completely

unrelated to the data handling - in which case the point is moot.

Let us know how it goes :)
Ryan

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



--
#define QUESTION ((bb) || (!bb)) /* Shakespeare */



--
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



--
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] newbie has waited days for a DB build to complete. what's up with this.

Reply via email to