On Fri, 29 Jan 2016 01:34:01 +0200 R Smith <rsmith at rsweb.co.za> wrote:
> > I think the dictionary would be faster for this use-case (mostly cause > it runs in-memory and we do not require all the DB data guarantees for > the initial sort). However, the fastest way in SQL would be to use a > temporary table, do you have any problems with that? It would mean you > would do the inserts in an insert loop and then simply calculate the > incidence values and insert to the real table - so it would require a 2 > - step process. > > In fact, let me write the SQL quick here. The table create is just like > before, and then: > > BEGIN TRANSACTION; > > -- Make Temporary table > CREATE TEMPORARY TABLE eav ( > e TEXT COLLATE NOCASE, > a TEXT COLLATE NOCASE, > v TEXT COLLATE NOCASE > ); > > -- The insert loop: > INSERT INTO eav VALUES (:element, :attribute, :value); > -- end of Insert loop > > > CREATE INDEX Idx_eav ON eav(e,a,v); > > INSERT OR IGNORE INTO element_attribute_values (element,attribute,value,cnt) > SELECT e,a,v,COUNT(*) > FROM EAV > WHERE 1 > GROUP BY e,a,v; > > -- Cleanup > DROP TABLE eav; > > COMMIT; > > > If that is not a LOT faster then I will be very surprised... I tried this, using a single transaction for the whole and I measured 15 seconds instead of 22. Given that the program takes 6 seconds without insertions, that's good. I finally use another way: I use a Python counter dictionary (the `Counter` class in the `collections` module) to compute the counts on a per file basis, then there is an insert?or?ignore?then?update request for each triplet, which add to "cnt" instead of incrementing it by just one. In fewer words, there is only one insert/update per triplet for a given file. Instead of 103K requests to give 15K rows, there is now 23K requests to give the same 15K rows. Using this with a transaction per file (as the program always did), I measured 17 seconds. So I devised to group transactions transparently, a transaction is really committed only each N times (it takes care of any reminder), and using 25 for N, I measure 11 seconds. I believe that's good enough (I guess it may be good to avoid big transactions). -- Yannick Duch?ne