[sqlite] I don't understand how to use NOT EXISTS

R.Smith Thu, 17 Sep 2015 21:23:32 +0200


On 2015-09-17 05:13 AM, Nicolas J?ger wrote:
> @Ryan,
>> To get you started on this road, may I ask you take a leap of faith
>> and explain to us from A to Z what you would like to achieve with the
>> tables and data you are asking about above - i.e don't show us the
>> programming you think will solve the problem and ask us if that is
>> good SQL, rather explain us the problem to be solved and allow us to
>> show you the good SQL that will do it.
>>
>> Every next task should be clearer and easier.
> Actualy, it's a personal open source project. So the code could be see
> by anyone. The only thing is, the most recent part isn't yet on the
> repo (last push was before the summer). But, if for some reason you
> want to see the code as it is today on my computer (almost all files
> were heavily modified during the summer), I can push it.
>
> this project is about helping me to manage all files I have on my
> computer. I'm a metalhead and science nerd at the same time.
> So, I have tons of music and tons of ebooks and publications... So my
> project as currently two goals:
>
>    1) store the data on disk, I should not access the files directly.
>
>    2) using a DB to retrieve any document trough a search engine.
>
> (actually I want to add a third goal: I want to create something I
> don't know how I will call it, but it's something like a binder. For
> example if you open a binder about 'modelling demons', you can link data
> of differents files even if they don't share tags or whatever (can be
> in that example some texture, some .blend or pdf about modelling
> humanoid...).)
>
> So, I'm writing some custom GTKmm widgets wich interact with the DB
> (but they don't directly access any sqlite function, I'm working with
> engines communication (gtkmm_engine, sqlite_engine, etc.)).
>
> I'm currently thinking to make that program modular. For example, the
> user (usualy me), set what kind of data (trough some config file (in
> my head I call these files 'collection')) should be store in the
> DB. For each data will correspond a widget.
>
> Let's took an example,
>
> for ebooks:
>
> I want to save the filename (filename is ruled by the program, the
> user has no direct way to modify it), title, some tags and a description
> for each ebook (for the last three the program shows some widgets,
> wich allow the user to modify the values).
>
> At this point, when the program read the ebook.collection, it will
> create a first table EBOOK with four columns
> (ID,FILENAME,TITLE,DESCRIPTION)
>
> for the tags, things are more complicated. I need two tables:
> - TAGS with three columns (ID,NAME,COUNT)
> - TAGSMAP with two columns (ID_EBOOK,ID_TAGS) (I explain the use below)
>
> so if I take my old example :
>
> (id|NAME|COUNT)
> 53|black metal|3
> 55|occult rock|3
> 61|satanic smurfs|1
> 62|beer|0
> 63|pizza|0
>
> if I'm seeking to get all files with the tags 'black metal' the program
> check all entries in TAGSMAP with ID_TAGS = 53,
>
> SELECT * FROM TAGSMAP WHERE ID_TAGS = 53 would returns something like,
>
> (ID_EBOOK|ID_TAGS)
> 3|53
> 5|53
> 9|53
>
> then the ID_EBOOKs refer to the IDs of the data (title and description
> in this case) for some files in EBOOK.
>
> now my problems come when you want to add/delete a tags. If for a file
> you want (eg) to delete the tag 'black metal' you click on the button I
> setup to delete that tag, the program at this time has two
> informations:
>
> the current EBOOK.FILENAME and the TAGS.NAME of the tag.
>
> So I have to query the DB to get their respective IDs (ID of that
> FILENAME in EBOOK and ID of that NAME in TAGS).
>
> Then, the program has to delete the entry in the TAGSMAP where ID_EBOOK
> is the same than ID for the current file in EBOOK and where ID_TAGS
> equal the ID in TAGS.
>
> For TAGS, the program has first to check COUNT. Before the check I have
> to get the value from the DB.
>
>    If COUNT = 1, the program has to delete the tag with NAME 'black
>    metal'.
>
>    If COUNT > 1, the program has to decrement count.
>
> My first manner to do that, was to just store/read the data to/from the
> DB. All conditions (like what I try to do for some days now...) were
> executed by the program, not by queries. But the problem (wich is
> not really a big issue), I do a lot of queries because of the IFs in my
> program then I get one/two seconds of lags (wich may also be here
> because I wrote all operation I do in the console for debug
> purpose...)...
>
> So, as I wrote in an earlier email, since I know I can use
> transaction, I try to replace all queries in one giant query send to
> the DB when I want to update the data for a file (title, tags and
> description).
>
> Maybe, I should ask if what I want to do is really possible by
> SQL or if I should better let the IFs in the program better than
> struggling like I did in my last emails...
>
>
> I really hope I was clear. If you need further explanation about
> something just tell it. Anyway thank you for the time you spend to
> explain your magic to some "muggle" like me...


Ok, firstly, these things you wish to do is exactly the most common use 
of Database engines. Is it suitable to do in a database?, heck these are 
the exact things that databases exist for. You will also be happy to 
hear there already exists a myriad of programs doing exactly what you 
are designing here, perhaps with a slightly different paradigm or 
use-case, but in terms of data-usage, much the same.

Some initial things.
Data needs to be kept in 1NF (or 1st-Normal-Form) which is relational 
theory speak for "Everything without a 1-to-1 relationship with the key 
field in a table, should be in another table".  Many reasons for this 
(if you care to read up on some RT) but the most simple reason is: We 
can make complex extractions from 1NF data tables via JOINs, sub-queries 
and the like, but we cannot extract simple information (or de-complicate 
it) from convoluted tables.

Example: You can store a person's ID, name, surname, hair colour, age 
all in 1 table, but you can't put the names of his/her children in 
there, because no longer a 1-to-1 relationship. In your case, all files 
can go in 1 table with their respective types, genres, usage, path, etc. 
but not the tags. Tags will need to be in another table.

This also has a reverse-implication: Do not put things which are in 
effect similar, and might differ only by simple properties with 1-to-1 
relations, in different tables. Put them together.

Example: In a "People" database, you can list everyone's occupation. 
There is no need to make a table for only scientists, and another for 
only Entertainers. Just use 1 table and make sure it has a property 
(field) for "Occupation".  In your use-case, there is no need to have 
different tables for E-Books or Videos or Music... they are all just 
files with slightly different properties. Sure, an e-book might have an 
edition while a music file would typically not, and a music file might 
have both an artist and a composer (and perhaps an author) while an 
e-book only have an author, but these are minute differences and will 
manifest in very few fields. There is no harm in having fields for all 
of Author, Composer and Artist, even if some entries do not use all of it.

Then again, you might wish to record a lot more information about say 
the Author of anything, like gender, date of birth, country, etc. - in 
which case Author information will definitely move to another table, but 
you will still have an "Author" field in the main table to know how to 
link to the other table.

Making combinations later (your idea about a sort of binder) will be 
especially easy when all relevant files are maintained in a single table 
with foreign keys to other tables as needed.

Most combinations can be achieved by a simple table with double-keys in 
it. For instance the TAGS table as you've described it will be a simple 
table with 2 fields (columns) containing the FILE-table key in one, and 
the TAG key in the other, with the primary key for this TAGS table made 
over both columns. That way it is impossible to duplicate things, and 
none of all that delete tags stuff is needed that you noted in your design.
This will also make everything easy w.r.t. counting how many tags of 
which kinds exist and so on. And if you need to know how many tags of a 
certain kind exists but only for e-books, then that is achieved with a 
simple join lookup.

This brings me to a good next point: When programming, we always try to 
make the data as meaningful as possible, such as storing reference 
counts etc. But in a database, we try to make the data as plain as 
possible - i.e. you should never store data values which can be attained 
from a query inside the database. (Unless the best possible method of 
obtaining it is still too slow, but that is an exercise for much later 
and will require optimizations which should never be entertained 
prematurely).

I can envision 3 initial tables you will need:
FILES
- to keep all file-references with their id, file_Name, file_Path, 
file_extension,  kind (e-Book, Music, etc.), size, opens_with, 
run_length, author, composer, artist, etc.
TAGNAMES
- to keep a simple list of Tag IDs and their Names/Captions/Descriptions
TAGS
- a simple reference table with 2 columns (file_id, tag_id) to keep 
track of which tags are added to which files.

(More may be needed later)
Note: Things like "Genre" can be envisioned as a Tag rather - this way 
you can add multiple genres (or none) to a single file.

A first-presentation schema might look something like this:

CREATE TABLE "Files" (
   id INTEGER PRIMARY KEY,
   file_kind TEXT NOT NULL COLLATE NOCASE,
   size INT NOT NULL,
   run_length INT,
   rating INT,
   file_name TEXT NOT NULL COLLATE NOCASE,
   file_extension TEXT COLLATE NOCASE,
   file_path TEXT NOT NULL COLLATE NOCASE,
   artist TEXT,
   author TEXT,
   composer TEXT,
   synopsis TEXT,
   CONSTRAINT UFile_Key UNIQUE (file_name, file_extension, file_path)
);

CREATE TABLE "TagNames" (
   name TEXT NOT NULL COLLATE NOCASE PRIMARY KEY,
   kind TEXT NOT NULL COLLATE NOCASE DEFAULT '',
   description TEXT COLLATE NOCASE
) WITHOUT ROWID;

CREATE TABLE "Tags" (
   file_id INTEGER,
   tag_name TEXT NOT NULL COLLATE NOCASE,
   PRIMARY KEY (file_id, tag_name),
   FOREIGN KEY (file_id)  REFERENCES "Files"(id) ON DELETE CASCADE ON 
UPDATE CASCADE,
   FOREIGN KEY (tag_name)  REFERENCES "TagNames"(name) ON DELETE CASCADE 
ON UPDATE CASCADE
) WITHOUT ROWID;


Note: the COLLATE NOCASE options added specifically to the file_* fields 
above assumes NTFS or such file system that is NOT case-sensitive. If 
this is to run on Linux or indeed anything with a case-sensitive 
file-system, then best is to keep the file name, path and extension also 
case-sensitive by removing those NOCASE options. Other NOCASE fields 
should remain as is.

Some Queries for basic tasks
----------------------------------------
1 - Add a normal Tag-name called "SciFi":
REPLACE INTO "TagNames" (name) VALUES ('SciFi');

2 - Add a genre-specific Tag-name called "SciFi" with some description 
(this will overwrite the previous entry of Sci-Fi with more detail):
REPLACE INTO "TagNames" (kind, name, description) VALUES ('genre', 
'SciFi', 'Science-Fiction');

3 - Add a File for an e-book called "The Martian"
REPLACE INTO "Files" (id, file_kind, size, file_name, file_extension, 
file_path, author, synopsis) VALUES
(1, 'ebook', 10424, 'the_martian_book.txt', 'txt', 'd:\documents\books', 
'Andy Weir',
  'Six days ago, astronaut Mark Watney became one of the first people to 
walk on Mars. Now, he's sure he'll be the first person to die there.');

4 - Assign the genre-specific tag we already added to the tag-names list 
to this book:
INSERT INTO "Tags" VALUES (1, 'SciFi');
Note: This will fail if either the book ID 1 or the tag name 'SciFi' 
does not exist yet.
Note2: It will also fail if tried a second time, since the combination 
of file id and tag name must be unique in the table.
Note3: There are different way of attaining an inserted id, if you 
prefer not to control the id yourself but relinquish it to the AUTOINC 
mechanism.

5 - Change the Tag-Name 'SciFi' to 'Sci-Fi':
UPDATE "TagNames" SET name = 'Sci-Fi' WHERE name = 'SciFi';
Note: The above will change all SciFi tags in the "Tags" table to 
'Sci-Fi' so that the relationship is not broken - all thanks to the ON 
UPDATE CASCADE in the foreign key.
Note2: This requires Foreign-Key checking is ON for your database 
connection (It is off by default, run SQL "PRAGMA foreign_keys = ON;" to 
enable.
Read more at:
https://www.sqlite.org/foreignkeys.html

6 - Delete the Tag-name "Sci-Fi"
DELETE FROM "TagNames" WHERE name = 'Sci-Fi';
Again, this will remove all such tags throughout the "Tags" table too 
because of the foreign-key relationship, but you are equally welcome to run:
DELETE FROM "Tags" WHERE tag_name = 'Sci-Fi';

7 - Find all e-books and their genres with txt extensions that resides 
somewhere on the D drive:
SELECT F.file_path, F.file_name, max(F.size) AS Size, 
group_concat(G.name) AS Genres
FROM Files
   LEFT JOIN Tags AS T ON T.file_id = F.id
   LEFT JOIN TagNames AS G ON G.name = F.tag_name AND G.kind = 'genre'
WHERE F.kind = 'ebook' AND F.file_path LIKE 'D:%' AND F.file_extension = 
'txt'
GROUP BY F.file_path, F.file_name

Note: The max(F.size) is just to satisfy SQL grouping semantics, it is 
imposible in the schema to have more than one file with same path and 
same name, so size will always be the same for all items in such a group.


Ok, enough for day 1. I've been as brief as possible, but lots of these 
conventions may need some elaboration (not enough space here to add all 
of it), but please do ask where needed. Obviously your schema will be 
more complete (I imagine adding such things as release_date, IMDB_link, 
perhaps youtube link etc.), but once you have a good schema, we can 
elaborate more on the specifics. Another table might keep a list of 
reviews or references found online - The World is your oyster....

btw: Your binder idea is equally easy implementable with a similar set 
of tables as the Tags above. One table to record a Binder name, purpose 
and such, and another foreign-keyed dual-reference list connecting every 
included file to its binder.

I'm hoping the idea is shining through on how this is different from 
normal programming but much much more powerful for handling data.


Keep us posted :)
Ryan

[sqlite] I don't understand how to use NOT EXISTS

Reply via email to