[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-02-01 Thread Yannick Duchêne
On Thu, 28 Jan 2016 22:08:02 +0200
R Smith  wrote:

> I think you are misunderstanding the Pragma and the idea of automatic 
> indices. An automatic Index might be created on a table that doesn't 
> have an adequately assigned primary key. It might also be created during 
> a query (mostly SELECT or sub-SELECT) for which there is no useful Index 
> created by the table designer and the Query planner figures it will be 
> quicker to make an Index than to do table scans through the query. This 
> habit of creating indices during select queries can be forced to not 
> happen by setting the "PRAGMA automatic_index=0;", but this needs to 
> happen when you open the DB connection, or at a minimum, before you try 
> any query - not after the cursor is created,  by that time the index 
> might already be made. (This is why you are not seeing any speed 
> improvement).

Here is, from the documentation: https://www.sqlite.org/optoverview.html
At the very bottom, is said:
> Do not confuse automatic indexes with the internal indexes (having names like
> "sqlite_autoindex_table_N") that are sometimes created to implement a PRIMARY
> KEY constraint or UNIQUE constraint. The automatic indexes described here
> exist only for the duration of a single query, are never persisted to disk,
> and are only visible to a single database connection. Internal indexes are
> part of the implementation of PRIMARY KEY and UNIQUE constraints, are
> long-lasting and persisted to disk, and are visible to all database
> connections. The term "autoindex" appears in the names of internal indexes
> for legacy reasons and does not indicate that internal indexes and automatic
> indexes are related.

That's this legacy naming which causes confusion.


-- 
Yannick Duch?ne


[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-29 Thread Yannick Duchêne
On Thu, 28 Jan 2016 19:18:06 -0500
"Keith Medcalf"  wrote:

> 
> When you create a table thusly:
> 
> create table x (x primary key, y, z);
> 
> you are creating a rowid table with columns x, y, an z.  You are also saying 
> that you want x to be the primary key.  Therefore, you will get a table 
> called x containing (ROWID, X, Y, Z).  This is a btree where the index is the 
> rowid.  In order to implement you PRIMARY KEY, an index must be constructed 
> containing the column X and ROWID in the table where the value of X is found. 
>  This index will be called sqlite_autoindex_x_1 because it is the first index 
> created automatically on table x.

Close to what I was suspecting. Thanks for these points.

> You could achieve the same result as follows:
> 
> create table x (x, y, z);
> create unique index pk_x_x on x(x);
> 
> except that now you have given the index on x an explicit name and one does 
> not have to be manufactured for you.
> 

I checked these indexes gets created even if I manually define similar indexes. 
But I don't mind. After some experiments, this seems more related to ROWID, and 
you seem to confirm a relation, above.


-- 
Yannick Duch?ne


[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-29 Thread Yannick Duchêne
On Fri, 29 Jan 2016 01:34:01 +0200
R Smith  wrote:

> 
> I think the dictionary would be faster for this use-case (mostly cause 
> it runs in-memory and we do not require all the DB data guarantees for 
> the initial sort). However, the fastest way in SQL would be to use a 
> temporary table, do you have any problems with that? It would mean you 
> would do the inserts in an insert loop and then simply calculate the 
> incidence values and insert to the real table - so it would require a 2 
> - step process.
> 
> In fact, let me write the SQL quick here. The table create is just like 
> before, and then:
> 
> BEGIN TRANSACTION;
> 
>-- Make Temporary table
> CREATE TEMPORARY TABLE eav (
>e TEXT COLLATE NOCASE,
>a TEXT COLLATE NOCASE,
>v TEXT COLLATE NOCASE
> );
> 
>-- The insert loop:
> INSERT INTO eav VALUES (:element, :attribute, :value);
>-- end of Insert loop
> 
> 
> CREATE INDEX Idx_eav ON eav(e,a,v);
> 
> INSERT OR IGNORE INTO element_attribute_values (element,attribute,value,cnt)
>SELECT e,a,v,COUNT(*)
>  FROM EAV
> WHERE 1
> GROUP BY e,a,v;
> 
>-- Cleanup
> DROP TABLE eav;
> 
> COMMIT;
> 
> 
> If that is not a LOT faster then I will be very surprised...

I tried this, using a single transaction for the whole and I measured 15 
seconds instead of 22. Given that the program takes 6 seconds without 
insertions, that's good.

I finally use another way: I use a Python counter dictionary (the `Counter` 
class in the `collections` module) to compute the counts on a per file basis, 
then there is an insert?or?ignore?then?update request for each triplet, which 
add to "cnt" instead of incrementing it by just one. In fewer words, there is 
only one insert/update per triplet for a given file. Instead of 103K requests 
to give 15K rows, there is now 23K requests to give the same 15K rows. Using 
this with a transaction per file (as the program always did), I measured 17 
seconds. So I devised to group transactions transparently, a transaction is 
really committed only each N times (it takes care of any reminder), and using 
25 for N, I measure 11 seconds. I believe that's good enough (I guess it may be 
good to avoid big transactions).


-- 
Yannick Duch?ne


[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-29 Thread Keith Medcalf
> > You could achieve the same result as follows:
> >
> > create table x (x, y, z);
> > create unique index pk_x_x on x(x);
> >
> > except that now you have given the index on x an explicit name and one
> does not have to be manufactured for you.
> >
> 
> I checked these indexes gets created even if I manually define similar
> indexes. But I don't mind. After some experiments, this seems more related
> to ROWID, and you seem to confirm a relation, above.

Well, yes, of course.  If you ask for two indexes to be created, two indexes 
will be created.

create table x (x primary key, y, z);
create unique index pk_x_x on x(x);

requests the creation of two indexes.  One by the "primary key" contained in 
the table definition, and one by the create index.  If you only want one, then 
only say to create one.

https://www.sqlite.org/optoverview.html#autoindex

see especially the last paragraphs of that section, and also

https://www.sqlite.org/fileformat2.html#intschema


Except for without_rowid tables, specifying "primary key" and "unique" within 
the table definition is just syntactic sugar for creating the indexes manually 
using create index; much in the same way ... JOIN ... ON ... is merely 
syntactic sugar for FROM ..., ... WHERE ... (except in the case of outer joins 
and the non-standard CROSS JOIN which cannot be expressed without the sugar).







[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-29 Thread R Smith


On 2016/01/28 11:44 PM, Yannick Duch?ne wrote:
> On Thu, 28 Jan 2016 22:08:02 +0200
> R Smith  wrote:
>
> Indeed, I was misunderstanding. Are these indexes visible? Is there 
> any relation with the ones I see from sqlitebrowser? 

If you mean you are looking at the DB schema and seeing automatic 
indices there, then those are mostly the created ones where the table 
doesn't contain a suitable index or perhaps the rowid. If you mean the 
automatic indices that would be created in a select query for which the 
query planner decides to add an index to facilitate faster querying - 
well those can't be seen like that, but you can inspect the output from 
"EXPLAIN QUERY PLAN" or just "EXPLAIN" (added to the front of your 
query). These will show the query planner's steps and scan plans 
respectively (showing any such auto indices it might use).

>
> The shape will change, it's far from final. The tables at that step depends 
> on the next steps in the overall procedure, which is not entirely fixed for 
> now.

Good to know.

>> Could you try these in your Query loop perhaps:
>>
>>
>> -- Table: Renamed field count to cnt because "count" is an SQL reserved
>> word (though it will work, just a better habit)
>> CREATE TABLE element_attribute_values (
>> element TEXT COLLATE NOCASE NOT NULL,
>> attribute TEXT COLLATE NOCASE NOT NULL,
>> value TEXT COLLATE NOCASE NOT NULL CHECK ((attribute <> '') OR (value
>> = '')),
>> cnt INTEGER NOT NULL DEFAULT 0,
>> PRIMARY KEY (element, attribute, value)
>> );
>>
>> -- Insert Loop start:
>> -- This will simply fail if the PK already exists, else start the line
>> with 0 count.
>> INSERT OR IGNORE INTO element_attribute_values VALUES (:element,
>> :attribute, :value, 0);
>>
>> -- This might be faster since it uses only one lookup loop, but it might
>> also not be.
>> -- If you share a list of example data to be inserted, we can find a
>> faster way. Try it and let us know...
>> WITH EAV(id,icnt) AS (
>> SELECT rowid,cnt+1
>>   FROM element_attribute_values
>>  WHERE (element = :element) AND (attribute = :attribute) AND (value =
>> :value)
>>  LIMIT 1
>> )
>> UPDATE element_attribute_values SET cnt = (SELECT icnt FROM EAV)
>>WHERE rowid = (SELECT id FROM EAV)
>> ;
>> -- Insert Loop end.
>>
> The default may even be omitted, and may be the `LIMIT 1` too, as each 
> triplet is unique.

Yeah, the LIMIT 1 is just to force single-row-results which are required 
for the sub-query - it is however not possible to obtain multiple row 
results here, so the limit is superfluous in this case, but I like 
keeping it in for legibility.

> I tried it, with the variant (no `LIMIT 1`) and it's about the same, even a 
> bit more slow (not much, between one and two seconds more long). While speed 
> is not the only concern, or perhaps I should not care that much about the DB 
> file size (the other matters).
>
> It's close to what I had at the beginning, which gave similar timings (just 
> changed to match your recommendation about "cnt"'s name):

Yeah, I did not have a test-bed but I thought it might not improve much. 
Post a table somewhere with typical values (the 103k rows) that would be 
needing insertion, we might find a real fast way.


>
> I'm aware this use case may be a bit pathological, as I could use Python's 
> dictionary. However, I decided to not to, for three reasons:
>
>   * I wanted to see what it's like to use an SQLite DB as an application data 
> container (file or memory)?;
>   * Using a persistent DB is better for incremental process (may stop and 
> resume later)?;
>   * Persistent data is welcome for human review (I think about defining views 
> in sqlitebrowser to dissect the results)?;
>
> For persistence, I first tried CSV files, but this shows to be a inadequate. 
> An SQL DB and a DB browser, looks better than CSV for this use?case.
>

I think the dictionary would be faster for this use-case (mostly cause 
it runs in-memory and we do not require all the DB data guarantees for 
the initial sort). However, the fastest way in SQL would be to use a 
temporary table, do you have any problems with that? It would mean you 
would do the inserts in an insert loop and then simply calculate the 
incidence values and insert to the real table - so it would require a 2 
- step process.

In fact, let me write the SQL quick here. The table create is just like 
before, and then:

BEGIN TRANSACTION;

   -- Make Temporary table
CREATE TEMPORARY TABLE eav (
   e TEXT COLLATE NOCASE,
   a TEXT COLLATE NOCASE,
   v TEXT COLLATE NOCASE
);

   -- The insert loop:
INSERT INTO eav VALUES (:element, :attribute, :value);
   -- end of Insert loop


CREATE INDEX Idx_eav ON eav(e,a,v);

INSERT OR IGNORE INTO element_attribute_values (element,attribute,value,cnt)
   SELECT e,a,v,COUNT(*)
 FROM EAV
WHERE 1
GROUP BY e,a,v;

   -- Cleanup
DROP TABLE eav;

COMMIT;


If that is not a LOT faster then I will be very surprised...



[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-28 Thread Yannick Duchêne
On Thu, 28 Jan 2016 22:08:02 +0200
R Smith  wrote:

> I think you are misunderstanding the Pragma and the idea of automatic 
> indices. An automatic Index might be created on a table that doesn't 
> have an adequately assigned primary key. It might also be created during 
> a query (mostly SELECT or sub-SELECT) for which there is no useful Index 
> created by the table designer and the Query planner figures it will be 
> quicker to make an Index than to do table scans through the query. This 
> habit of creating indices during select queries can be forced to not 
> happen by setting the "PRAGMA automatic_index=0;", but this needs to 
> happen when you open the DB connection, or at a minimum, before you try 
> any query - not after the cursor is created,  by that time the index 
> might already be made. (This is why you are not seeing any speed 
> improvement).
> 
> I only mention all the above so you understand what the automatic 
> indexing is about, but it has almost certainly nothing to do with your 
> query slowness, and even if you switch it off at a more opportune time, 
> I would be surprised if it changes the query speed.

Indeed, I was misunderstanding. Are these indexes visible? Is there any 
relation with the ones I see from sqlitebrowser?

> 103k insertions to produce 15k rows... that is ~15% efficiency - the 
> opposite of good design. Perhaps we can help you find better SQL to 
> solve your problem. Let's see...

(red-face)

> 
> I do not know the shape of your data (it matters), but I'm guessing 
> "element" represents standard HTML tags with "attribute" and "value" 
> giving basic expansion of the attributes list. it'd probably be safer to 
> use non case-sensitive values and use standard equation tests in Selects.
> 

The shape will change, it's far from final. The tables at that step depends on 
the next steps in the overall procedure, which is not entirely fixed for now.

> Could you try these in your Query loop perhaps:
> 
> 
> -- Table: Renamed field count to cnt because "count" is an SQL reserved 
> word (though it will work, just a better habit)
> CREATE TABLE element_attribute_values (
>element TEXT COLLATE NOCASE NOT NULL,
>attribute TEXT COLLATE NOCASE NOT NULL,
>value TEXT COLLATE NOCASE NOT NULL CHECK ((attribute <> '') OR (value 
> = '')),
>cnt INTEGER NOT NULL DEFAULT 0,
>PRIMARY KEY (element, attribute, value)
> );
> 
> -- Insert Loop start:
> -- This will simply fail if the PK already exists, else start the line 
> with 0 count.
> INSERT OR IGNORE INTO element_attribute_values VALUES (:element, 
> :attribute, :value, 0);
> 
> -- This might be faster since it uses only one lookup loop, but it might 
> also not be.
> -- If you share a list of example data to be inserted, we can find a 
> faster way. Try it and let us know...
> WITH EAV(id,icnt) AS (
>SELECT rowid,cnt+1
>  FROM element_attribute_values
> WHERE (element = :element) AND (attribute = :attribute) AND (value = 
> :value)
> LIMIT 1
> )
> UPDATE element_attribute_values SET cnt = (SELECT icnt FROM EAV)
>   WHERE rowid = (SELECT id FROM EAV)
> ;
> -- Insert Loop end.
> 

The default may even be omitted, and may be the `LIMIT 1` too, as each triplet 
is unique.

I tried it, with the variant (no `LIMIT 1`) and it's about the same, even a bit 
more slow (not much, between one and two seconds more long). While speed is not 
the only concern, or perhaps I should not care that much about the DB file size 
(the other matters).

It's close to what I had at the beginning, which gave similar timings (just 
changed to match your recommendation about "cnt"'s name):

INSERT OR IGNORE INTO element_attribute_values
  VALUES (:element, :attribute, :value, 0);
UPDATE element_attribute_values 
  SET cnt = cnt + 1
  WHERE (element = :element)
AND (attribute = :attribute)
AND (value = :value);

> -- If you share a list of example data to be inserted, we can find a 
> faster way. Try it and let us know...

I don't mind, I can upload an archive somewhere. I guess you mean table's 
content?

I'm aware this use case may be a bit pathological, as I could use Python's 
dictionary. However, I decided to not to, for three reasons:

 * I wanted to see what it's like to use an SQLite DB as an application data 
container (file or memory)?;
 * Using a persistent DB is better for incremental process (may stop and resume 
later)?;
 * Persistent data is welcome for human review (I think about defining views in 
sqlitebrowser to dissect the results)?;

For persistence, I first tried CSV files, but this shows to be a inadequate. An 
SQL DB and a DB browser, looks better than CSV for this use?case.

-- 
Yannick Duch?ne


[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-28 Thread R Smith


On 2016/01/28 10:08 PM, R Smith wrote:
>
>
> -- Insert Loop start:
> -- This will simply fail if the PK already exists, else start the line 
> with 0 count.
> INSERT OR IGNORE INTO element_attribute_values VALUES (:element, 
> :attribute, :value, 0);
>
> -- This might be faster since it uses only one lookup loop, but it 
> might also not be.
> -- If you share a list of example data to be inserted, we can find a 
> faster way. Try it and let us know...
> WITH EAV(id,icnt) AS (
>   SELECT rowid,cnt+1
> FROM element_attribute_values
>WHERE (element = :element) AND (attribute = :attribute) AND (value 
> = :value)
>LIMIT 1
> )
> UPDATE element_attribute_values SET cnt = (SELECT icnt FROM EAV)
>  WHERE rowid = (SELECT id FROM EAV)
> ;
> -- Insert Loop end.
>

Upon re-reading I realised my descriptions were confusing. To be clear, 
this is the entire loop SQL that should form part of your program:

-- Insert Loop start:
INSERT OR IGNORE INTO element_attribute_values VALUES (:element,
:attribute, :value, 0);

WITH EAV(id,icnt) AS (
   SELECT rowid,cnt+1
 FROM element_attribute_values
WHERE (element = :element) AND (attribute = :attribute) AND
(value = :value)
LIMIT 1
)
UPDATE element_attribute_values SET cnt = (SELECT icnt FROM EAV)
  WHERE rowid = (SELECT id FROM EAV)
;
-- Insert Loop end.




[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-28 Thread R Smith


On 2016/01/28 8:17 PM, Yannick Duch?ne wrote:
> Hi people,
>
> I though automatic index was controlled by the pragma `automatic_index`, 
> which seems to do nothing for me, unless I'm doing it the wrong way, while 
> `WITHOUT ROWID` on table creations, seems to effectively prevent automatic 
> indexes, the latter with or without the `PRAGMA automatic_index=0`.
>
> Is this expected or is this me doing something wrong?
>
> I'm concerned about it mainly for file size, more than for speed, while it 
> matters too, as the program is far more slow than expected. For a program I'm 
> running I got this:
>
>* With ROWID: execution time is 22 seconds and DB file size is 1.9MB?;
>* Without ROWID, execution time is 19 seconds and DB file size is 788KB 
> (2.4 times smaller)?;
>* With insertion disabled, execution time is 6 seconds, so most of the 
> execution time is spent in rows insertion.
>
> The program indexes HTML files (*), the test runs on only 52 average sized 
> HTML files, so the difference will be even more important with bigger files 
> set.
>
> I'm feeling lost about this: the pragma seems to do nothing and it's `WITHOUT 
> ROWID` on table creations, which seems to do it instead. The pragma is 
> executed right after the cursor creation. I'm using APSW (**). Pragma seems 
> to works, as I tested with `PRAGMA synchronous=0` (just for a test, I really 
> don't want to use this).

I think you are misunderstanding the Pragma and the idea of automatic 
indices. An automatic Index might be created on a table that doesn't 
have an adequately assigned primary key. It might also be created during 
a query (mostly SELECT or sub-SELECT) for which there is no useful Index 
created by the table designer and the Query planner figures it will be 
quicker to make an Index than to do table scans through the query. This 
habit of creating indices during select queries can be forced to not 
happen by setting the "PRAGMA automatic_index=0;", but this needs to 
happen when you open the DB connection, or at a minimum, before you try 
any query - not after the cursor is created,  by that time the index 
might already be made. (This is why you are not seeing any speed 
improvement).

I only mention all the above so you understand what the automatic 
indexing is about, but it has almost certainly nothing to do with your 
query slowness, and even if you switch it off at a more opportune time, 
I would be surprised if it changes the query speed.

>
>
> I don't know if it's OK to ask for this kind of question here. If it's not, 
> just tell me, I won't bother.

It's always Ok to ask here. Welcome to SQLite-World. :)


>
>
> (*) There are bout 103 000 insertions queries ending into 15 600 rows, using 
> this:

103k insertions to produce 15k rows... that is ~15% efficiency - the 
opposite of good design. Perhaps we can help you find better SQL to 
solve your problem. Let's see...

>
>  CREATE TABLE element_attribute_values
>(element TEXT NOT NULL,
> attribute TEXT NOT NULL,
> value TEXT NOT NULL
>   CHECK ((attribute IS NOT '') OR (value IS '')),
> count INTEGER NOT NULL DEFAULT 1,
> PRIMARY KEY (element, attribute, value))
>
>  INSERT OR REPLACE INTO element_attribute_values
>(element, attribute, value, count)
>VALUES (:element, :attribute, :value,
>  (SELECT count + 1 FROM element_attribute_values
>WHERE (element IS :element)
>  AND (attribute IS :attribute)
>  AND (value IS :value)))
>
> (**) Standard Python SQLite3 binding seems perceptibly slow. With APSW, the 
> program is 17 times faster than with standard Python SQLite3 binding?? 
> execution time is around 370 seconds in the latter case, whatever how I 
> manage transactions.
>
>

APSW is far better and maintained by Roger Binns (whom you might see 
post here sometimes). Using that is your best bet, but it also cannot 
help heavy-set queries. The best strategy is to make better queries.

I do not know the shape of your data (it matters), but I'm guessing 
"element" represents standard HTML tags with "attribute" and "value" 
giving basic expansion of the attributes list. it'd probably be safer to 
use non case-sensitive values and use standard equation tests in Selects.

Could you try these in your Query loop perhaps:


-- Table: Renamed field count to cnt because "count" is an SQL reserved 
word (though it will work, just a better habit)
CREATE TABLE element_attribute_values (
   element TEXT COLLATE NOCASE NOT NULL,
   attribute TEXT COLLATE NOCASE NOT NULL,
   value TEXT COLLATE NOCASE NOT NULL CHECK ((attribute <> '') OR (value 
= '')),
   cnt INTEGER NOT NULL DEFAULT 0,
   PRIMARY KEY (element, attribute, value)
);

-- Insert Loop start:
-- This will simply fail if the PK already exists, else start the line 
with 0 count.
INSERT OR IGNORE INTO element_attribute_values VALUES (:element, 
:attribute, :value, 0);

-- This might be faster since i

[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-28 Thread Yannick Duchêne
On Thu, 28 Jan 2016 19:59:02 +0100
Dominique Devienne  wrote:

> Not sure to understand what you expect this pragma to do, but inserts
> typically don't involve automatic indexes, which are used only in queries
> (selects).

I though it was responsible for the `sqlite_autoindex_"table_name"_N` which 
gets generated and augments the DB size. That's the name and the documentation 
which makes me believe this. I can see these index in sqlitebrowser and at the 
file size.

> Your insert does have a query, but it fits the PK, so no automatic index is
> needed. (and you can likely use AUTOINCREMENT to avoid that query in the
> first place).

"count" is not a primary key, and AUTOINCREMENT may be used only on primary key.

It indeed does not need a primary key, a unique constraint is enough. Whether 
or not I define a PK or a unique constraint, it ends into the same speed and DB 
size. I just wanted to test if it differs when a PK is defined (it does not).


> The 2.4x size difference is kinda expected, since your PK includes your
> biggest column I suspect (i.e. value), so both the table, and its PK index
> most store it, doubling at least the memory, and with page waste, clearly
> can account for the difference.
> 
> And the time difference can be in-part at least attributed to the increased
> IO.

That was my conclusion too, and that's why I would like to prevent the creation 
of this extraneous index. I can achieve it using `WITHOUT ROWID`, and not with 
the pragma, and that's what I don't understand. So, I can achieve it, just that 
it's in a way I don't understand, which give me the feeling I'm using a trick 
and I don't like it. Unless ROWID is indeed what's responsible for the creation 
of these extraneous indexes??

> It seems unusual to use the HTML text in the PK. Are you sure you need it?

It's short texts (one to ten characters for an average), just element name, 
attribute name and attribute value, unique triplets with a count of their 
occurences. There are only a few attribute values which may be a bit long, the 
href of  elements. I must say I simplified my explanations, it's not really 
the index, it's a prior step which makes some statistics which are later used 
to decide what elements and attributes will be focused on to index the 
contents. The text the final index stores is reasonably short title-like texts.


-- 
Yannick Duch?ne


[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-28 Thread Dominique Devienne
On Thu, Jan 28, 2016 at 7:17 PM, Yannick Duch?ne 
wrote:

> I though automatic index was controlled by the pragma `automatic_index`,
> which seems to do nothing for me, unless I'm doing it the wrong way, while
> `WITHOUT ROWID` on table creations, seems to effectively prevent automatic
> indexes, the latter with or without the `PRAGMA automatic_index=0`.
>
> Is this expected or is this me doing something wrong?
>

Not sure to understand what you expect this pragma to do, but inserts
typically don't involve automatic indexes, which are used only in queries
(selects).

And automatic indexes are used only if there isn't an suitable index
already, and SQLite query planner things one will help.

Your insert does have a query, but it fits the PK, so no automatic index is
needed. (and you can likely use AUTOINCREMENT to avoid that query in the
first place).

The 2.4x size difference is kinda expected, since your PK includes your
biggest column I suspect (i.e. value), so both the table, and its PK index
most store it, doubling at least the memory, and with page waste, clearly
can account for the difference.

And the time difference can be in-part at least attributed to the increased
IO.

It seems unusual to use the HTML text in the PK. Are you sure you need it?
--DD


[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-28 Thread Keith Medcalf
cr.execute('BEGIN IMMEDIATE')
try:
  rowid = cr.execute('select rowid from element_attribute_values where 
element=? and attribute=? and value=?', (element, attribute, 
value)).fetchone()[0]
  cr.execute('update element_attribute_values set count=count+1 where rowid=?', 
(rowid,))
except:
  cr.execute('insert into element_attribute_values values (?, ?, ?, 1)', 
(element, attribute, value))
cr.execute('COMMIT')


could be replaced by (the below will probably be faster also):

cr.execute('BEGIN IMMEDIATE')
cr.execute('insert or ignore into element_attribute_values values (?, ?, ?, 
0)', (element, attribute, value))
cr.execute('update element_attribute_values set count=count+1 where element=? 
and attribute=? and value=?', (element, attribute, value))
cr.execute('COMMIT')


same caveat that if you want more speed you will have to move your transaction 
out to cover more updates.







[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-28 Thread Keith Medcalf

> > Not sure to understand what you expect this pragma to do, but inserts
> > typically don't involve automatic indexes, which are used only in
> > queries
> > (selects).
> 
> I though it was responsible for the `sqlite_autoindex_"table_name"_N`
> which gets generated and augments the DB size. That's the name and the
> documentation which makes me believe this. I can see these index in
> sqlitebrowser and at the file size.

When you create a table thusly:

create table x (x primary key, y, z);

you are creating a rowid table with columns x, y, an z.  You are also saying 
that you want x to be the primary key.  Therefore, you will get a table called 
x containing (ROWID, X, Y, Z).  This is a btree where the index is the rowid.  
In order to implement you PRIMARY KEY, an index must be constructed containing 
the column X and ROWID in the table where the value of X is found.  This index 
will be called sqlite_autoindex_x_1 because it is the first index created 
automatically on table x.

You could achieve the same result as follows:

create table x (x, y, z);
create unique index pk_x_x on x(x);

except that now you have given the index on x an explicit name and one does not 
have to be manufactured for you.

Similarly, if you used:

create table x (x primary key, y unique, z);

you would get a table containing (rowid, x, y, z) and an sqlite_autoindex_x_1 
implementing the unique key on x, and an sqlite_autoindex_x_2 implementing the 
unique constraint (index) on y.  If you wanted to explicitly specify the names 
of the indexes then you could say:

create table x (x, y, z);
create unique index pk_x_x on x(x);
create unique index uk_x_y on x(y);

The end result is the same.  These are indexes that you have DECLARED that you 
want and are entirely different from automatic indexes used to increase the 
performance of queries.

when you create a table thusly

create table x (x primary key, y, z) without_rowid;

you are creating a table without a rowid.  The table x consists entirely of an 
index where the indexed value is x, and the index "result" is y, z.  There is 
no actual table.

The problem with your original query is the use of INSERT or REPLACE.  Do not 
do that.  REPLACE is implemented as a DELETE (which means that all the indexes 
need to be updated and the tree's rebalanced) followed by an INSERT (which 
requires all the indexes to be updated and the trees to be rebalanced).  
Instead you should determine what you want to do and do it.  For example:

cr.execute('BEGIN IMMEDIATE')
try:
  rowid = cr.execute('select rowid from element_attribute_values where 
element=? and attribute=? and value=?', (element, attribute, 
value)).fetchone()[0]
  cr.execute('update element_attribute_values set count=count+1 where rowid=?', 
(rowid,))
except:
  cr.execute('insert into element_attribute_values values (?, ?, ?, 1)', 
(element, attribute, value))
cr.execute('COMMIT')


You can speed it up even more by moving your transactions out to cover multiple 
operations.  
In this way you never need to delete then re-insert the same row.

Remember, the only way to optimize I/O is not to do it.








[sqlite] Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

2016-01-28 Thread Yannick Duchêne
Hi people,

I though automatic index was controlled by the pragma `automatic_index`, which 
seems to do nothing for me, unless I'm doing it the wrong way, while `WITHOUT 
ROWID` on table creations, seems to effectively prevent automatic indexes, the 
latter with or without the `PRAGMA automatic_index=0`.

Is this expected or is this me doing something wrong?

I'm concerned about it mainly for file size, more than for speed, while it 
matters too, as the program is far more slow than expected. For a program I'm 
running I got this:

  * With ROWID: execution time is 22 seconds and DB file size is 1.9MB?;
  * Without ROWID, execution time is 19 seconds and DB file size is 788KB (2.4 
times smaller)?;
  * With insertion disabled, execution time is 6 seconds, so most of the 
execution time is spent in rows insertion.

The program indexes HTML files (*), the test runs on only 52 average sized HTML 
files, so the difference will be even more important with bigger files set.

I'm feeling lost about this: the pragma seems to do nothing and it's `WITHOUT 
ROWID` on table creations, which seems to do it instead. The pragma is executed 
right after the cursor creation. I'm using APSW (**). Pragma seems to works, as 
I tested with `PRAGMA synchronous=0` (just for a test, I really don't want to 
use this).


I don't know if it's OK to ask for this kind of question here. If it's not, 
just tell me, I won't bother.


(*) There are bout 103 000 insertions queries ending into 15 600 rows, using 
this:

CREATE TABLE element_attribute_values
  (element TEXT NOT NULL,
   attribute TEXT NOT NULL,
   value TEXT NOT NULL
 CHECK ((attribute IS NOT '') OR (value IS '')),
   count INTEGER NOT NULL DEFAULT 1,
   PRIMARY KEY (element, attribute, value))

INSERT OR REPLACE INTO element_attribute_values
  (element, attribute, value, count)
  VALUES (:element, :attribute, :value,
(SELECT count + 1 FROM element_attribute_values
  WHERE (element IS :element)
AND (attribute IS :attribute)
AND (value IS :value)))

(**) Standard Python SQLite3 binding seems perceptibly slow. With APSW, the 
program is 17 times faster than with standard Python SQLite3 
binding???execution time is around 370 seconds in the latter case, whatever how 
I manage transactions.


-- 
Yannick Duch?ne