On Sat, Aug 6, 2016 at 2:49 PM, Kevin O'Gorman <kevinogorm...@gmail.com>
wrote:

> On Sat, Aug 6, 2016 at 2:09 AM, Dan Kennedy <danielk1...@gmail.com> wrote:
>
>> On 08/06/2016 09:52 AM, Kevin O'Gorman wrote:
>>
>>> On Fri, Aug 5, 2016 at 2:03 PM, Dan Kennedy <danielk1...@gmail.com>
>>> wrote:
>>>
>>> On 08/06/2016 03:28 AM, Kevin O'Gorman wrote:
>>>>
>>>> On Fri, Aug 5, 2016 at 1:08 PM, David Raymond <david.raym...@tomtom.com
>>>>> >
>>>>> wrote:
>>>>>
>>>>> ......
>>>>>
>>>>> Apart from the default location of the files, it reads like your next
>>>>> main
>>>>>
>>>>>> concern is how many temp files get opened up. My bet is that it'll be
>>>>>> a
>>>>>> very small number, just potentially huge in file size while it's doing
>>>>>> its
>>>>>> thing. But again, try that pragma and take a look.
>>>>>>
>>>>>> My best bet is the contrary:  it starts with small files and makes
>>>>>>
>>>>> increasingly larger ones, like the sort utility does.  The problem is
>>>>> that
>>>>> there are too many of them at the beginning for it to work with
>>>>> anonymous
>>>>> files (which sort does not use).  This at least offers a possible
>>>>> explanation of its getting wedged on large indexes: an unexpected and
>>>>> untested error, handled poorly.
>>>>>
>>>>> You could verify this by checking the number of open handles in
>>>> "/proc/<pid>/fd" after your process is wedged.
>>>>
>>>> Excellent idea.  I did not know about that possibility.  And sure
>>>> enough,
>>>>
>>> I'm wrong.  It's using anonymous files, all right, but only one or two
>>> at a
>>> time.  I assume they're big.  I'm in the process of bracketing where size
>>> begins to matter.  So far, 1/10 of the data loads and indexes just fine,
>>> albeit somewhat more slowly that the smaller samples predicted.  The
>>> database load took 6.5 minutes, the troublesome index 10 minutes.  At
>>> smaller sizes, indexing is faster than the database load.
>>>
>>> I'm trying 1/3 now (500 million lines)
>>>
>>
>> What does [top] tell you once the process becomes wedged? What percentage
>> is the CPU running at? Or is it completely bogged down waiting for IO?
>>
>> Dan
>
>
> I'm waiting for a good time to get the answer to this.  It takes a good
> long while to get to wedged, so I'll probably do it overnight tonight.
>

RETRACTION: it doesn't get wedged after all, it just takes about 4 times
longer than I expected.  On small inputs, it makes the indexes faster than
the data table.  When I test on the whole data file, it takes 4 times
longer to build each index than it took to make the table.  I guess that's
what it looks like when building the table is O(n) complexity (linear), and
the index is O(n * log(n)) and log(n) is getting bigger.

Sorry for the kerfluffle.  This is my first time working with datasets this
big.  Since it takes almost 6 hours to build each index (there are 2 at
this point), and I didn't know to expect that, I drew the wrong
conclusion.  My bad.

I'm just glad Mr. Hipp got into this thread and advised to build the
indexes after the data is loaded.  It would have been _lots_ slower
inserting each index individually, and in fact my first attempt -- the one
that gave me the initial idea that this had wedged -- took this approach
and I decided it was wedged after a few _days_.  I suspect it wasn't
either, but might as well have been.

So far, this has been a feasibility exercise.  Building a 500 GB database
overnight is definitely feasible.  Now I can go on to see if it helps me
solve my problem.


-- 
#define QUESTION ((bb) || (!bb)) /* Shakespeare */
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to