On 08/06/2016 03:28 AM, Kevin O'Gorman wrote:
On Fri, Aug 5, 2016 at 1:08 PM, David Raymond <david.raym...@tomtom.com>
wrote:

There's a depreciated pragma, PRAGMA temp_store_directory =
'directory-name'; that apears to still work ok for now at least.
http://www.sqlite.org/pragma.html#pragma_temp_store_directory

I've used it for the same reasons you've both mentioned about space,
though for me it vacuuming a huge db, and the full size db copy that makes
which ate up my entire C drive. But with that pragma you can specify
exactly which folder you want to use for your temp files. I'd suggest try
using that, then monitor the folder you give it to see what shows up.

Since it's deprecated, I'll stick with TMPDIR because it's pretty standard
on Linux (it also works with the sort utility for instance).


Apart from the default location of the files, it reads like your next main
concern is how many temp files get opened up. My bet is that it'll be a
very small number, just potentially huge in file size while it's doing its
thing. But again, try that pragma and take a look.

My best bet is the contrary:  it starts with small files and makes
increasingly larger ones, like the sort utility does.  The problem is that
there are too many of them at the beginning for it to work with anonymous
files (which sort does not use).  This at least offers a possible
explanation of its getting wedged on large indexes: an unexpected and
untested error, handled poorly.

You could verify this by checking the number of open handles in "/proc/<pid>/fd" after your process is wedged.







<Insert normal disclaimers about suggested use of a depreciated feature>
-----Original Message-----
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org]
On Behalf Of Kevin O'Gorman
Sent: Friday, August 05, 2016 3:41 PM
To: SQLite mailing list
Subject: Re: [sqlite] newbie has waited days for a DB build to complete.
what's up with this.

On Fri, Aug 5, 2016 at 12:30 PM, Igor Korot <ikoro...@gmail.com> wrote:

Hi, Kevin,

On Fri, Aug 5, 2016 at 3:18 PM, Kevin O'Gorman <kevinogorm...@gmail.com>
wrote:
Okay, I followed some of the advice y'all gave and got some results.

1. The original problem was compromised by malformed input.  However,
it
appears that did not cause the wedging of the process.  See (3) below.
Where are the data will come?
 From the user? Internet?

What I'm getting at is - you need to look for some malformed data in
the future as well.

I generate it.  I goofed, and I'll try not to goof in the future.


2. I separated the steps, and started small.  Time increased slightly
sub-linearly with dataset size, so I jumped to doing the whole thing.
With
proper input, the data was loaded in 68 minutes.

3. The CREATE INDEX steps failed quickly (2 minutes), reporting
"database
or disk is full" which seemed odd since most of my partitions have much
more free space than the entire database.  It turns out that whatever
does
the creation was using space on my root partition (this is Linux, so
that
means "/").  That's the only partition in my setup without a huge
amount
of
free space.  On would expect temporary stuff to go to /tmp (which has
3TB
free), but it doesn't go there.  I would go there if the system's
native
"sort" program were used.  Fortunately, it turns out that the TMPDIR
environment variable is honored, but while I could see space was being
used, there were no files visible.  I take that to mean that the
tmpfile()
function (or equivalent) was used.  This could be a bad idea for large
indexes because anonymous files have to be kept open, and there's a
limit
on the number of files that can be open at a time, around 1,000.  Sure
enough, the index creation appears to be wedged like the original run,
and
after a few hours I killed it manually.  This is a deal-killer.
The failure you saw - is it on the table with the complete data set?
Or you got it during the experimenting?

Only on the complete data set.

So the questions are: Where do bug reports go?  I seem to be running
3.8.2;
is this fixed in any later version?
You can try the "3.14" pre-released one right now. ;-)

Meh.  I submitted a bug report to this list.  I'll see what happens.


Thank you.


On Thu, Aug 4, 2016 at 9:27 AM, Kevin O'Gorman <
kevinogorm...@gmail.com>
wrote:

The metric for feasability is coding ease, not runtime.  I'm the
bottleneck, not the machine, at least at this point.

As for adding rows, it will be about like this time: a billion or so
at
a
time.  But there's no need to save the old data.  Each round can be
separate except for a persistent "solutions" table of much more modest
size.  I've been doing this for a while now, and the solutions file
has
only 10 million or so lines, each representing a game position for
which
optimum moves are known.  Getting this file to include the starting
position is the point of the exercise.

If I ever get to anything like "production" in this project, I expect
it
to run for maybe three years...  That's after I tweak it for speed.

Background: in production, this will be running on a dual-Xeon with 16
cores (32 hyperthreads) and 1/4 TiB RAM.  It has sequential file
update
through Linux flock() calls at the moment.  The code is bash gluing
together a collection of UNIX utilities and some custom C code.  The C
is
kept as simple as possible, to minimize errors.

As you may surmise, this "hobby" is important to me.


On Thu, Aug 4, 2016 at 9:09 AM, R Smith <rsm...@rsweb.co.za> wrote:


On 2016/08/04 5:56 PM, Kevin O'Gorman wrote:

On Thu, Aug 4, 2016 at 8:29 AM, Dominique Devienne <
ddevie...@gmail.com>
wrote:


It's even less dense than that.  Each character has only 3 possible
values,
and thus it's pretty easy to compress down to 2 bits each, for a 16
byte
blob.
It's just hard to do that without a bunch of SQLite code I'd have to
learn
how to write.  The current effort amounts to a feasibility study,
and
I
want
to keep it as simple as possible.

A feasibility study using equipment that are hamstrung by weights
they
won't have in the real situation is not an accurate study.

It's like studying fuel consumption on a different kind of road
surface,
but for the test purposes, the cars had to tow caravans containing
their
testing equipment - the study will not look feasible at all.

It might of course be that the feasibility you are studying is
completely
unrelated to the data handling - in which case the point is moot.

Let us know how it goes :)
Ryan

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



--
#define QUESTION ((bb) || (!bb)) /* Shakespeare */



--
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



--
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users




_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to