On Fri, 29 Jan 2016 01:34:01 +0200
R Smith <rsmith at rsweb.co.za> wrote:

> 
> I think the dictionary would be faster for this use-case (mostly cause 
> it runs in-memory and we do not require all the DB data guarantees for 
> the initial sort). However, the fastest way in SQL would be to use a 
> temporary table, do you have any problems with that? It would mean you 
> would do the inserts in an insert loop and then simply calculate the 
> incidence values and insert to the real table - so it would require a 2 
> - step process.
> 
> In fact, let me write the SQL quick here. The table create is just like 
> before, and then:
> 
> BEGIN TRANSACTION;
> 
>    -- Make Temporary table
> CREATE TEMPORARY TABLE eav (
>    e TEXT COLLATE NOCASE,
>    a TEXT COLLATE NOCASE,
>    v TEXT COLLATE NOCASE
> );
> 
>    -- The insert loop:
> INSERT INTO eav VALUES (:element, :attribute, :value);
>    -- end of Insert loop
> 
> 
> CREATE INDEX Idx_eav ON eav(e,a,v);
> 
> INSERT OR IGNORE INTO element_attribute_values (element,attribute,value,cnt)
>    SELECT e,a,v,COUNT(*)
>      FROM EAV
>     WHERE 1
>     GROUP BY e,a,v;
> 
>    -- Cleanup
> DROP TABLE eav;
> 
> COMMIT;
> 
> 
> If that is not a LOT faster then I will be very surprised...

I tried this, using a single transaction for the whole and I measured 15 
seconds instead of 22. Given that the program takes 6 seconds without 
insertions, that's good.

I finally use another way: I use a Python counter dictionary (the `Counter` 
class in the `collections` module) to compute the counts on a per file basis, 
then there is an insert?or?ignore?then?update request for each triplet, which 
add to "cnt" instead of incrementing it by just one. In fewer words, there is 
only one insert/update per triplet for a given file. Instead of 103K requests 
to give 15K rows, there is now 23K requests to give the same 15K rows. Using 
this with a transaction per file (as the program always did), I measured 17 
seconds. So I devised to group transactions transparently, a transaction is 
really committed only each N times (it takes care of any reminder), and using 
25 for N, I measure 11 seconds. I believe that's good enough (I guess it may be 
good to avoid big transactions).


-- 
Yannick Duch?ne

Reply via email to