On Fri, 27 Mar 2009 15:53:18 +0100, Jonas Sandman
<[email protected]> wrote:
>Hello,
>
>I have a database with about 137000 * 2 rows with four columns;
>fileid, filename, filepath and istarget.
>It's used to determine if two scanned directories are equal so I run a
>simply query to get the rows that are missing on the target directory
>but do exists in the source directory.
>
>I use this query:
>
>SELECT f.filepath, f.filename FROM files f
>WHERE f.istarget=0
>AND NOT EXISTS (SELECT * FROM files WHERE filepath=f.filepath AND
>filename=f.filename AND istarget=1)
In the sub-SELECT you use * so you get all columns. That's
not necessary.
SELECT f.filepath, f.filename
FROM files f
WHERE f.istarget=0
AND NOT EXISTS (
SELECT ROWID FROM files
WHERE filepath=f.filepath
AND filename=f.filename
AND istarget=1
);
It might be better to rewrite as a self-join using LEFT
OUTER JOIN on filename and filepath, using an NULL istarget
from one or the other alias as an indication that target
instance of the file is missing.
Something like:
SELECT source.filepath, source.filename
FROM files AS source
LEFT OUTER JOIN files AS target
USING (filepath,filename)
WHERE source.istarget = 0
AND target.istarget IS NULL
ORDER BY whatever you like;
(untested)
>and I have experimented with some index to improve the speed, both
>index (filename, filepath, istarget), (filename), (filepath) etc...
The SQLite optimizer can only use one index at a time.
In general, the index should be as selective as possible.
This is known as cardinality.
In your case, (filename, filepath) should do,
unless all files in all directories have the same series of
names, in which (filepath, filename) could be better.
Adding istarget doesn't hurt, it would make the index usable
as a primary key.
If you defined more than one index, you can help the
optimizer to choose the best index by running ANALYZE; on a
database filled with representative data.
>I am still not quite satisfied with the speed (a few seconds to check this).
>Perhaps the table schema itself is the problem?
You could normalise some more, by creating a second table
CREATE TABLE Pathnames (
pathid INTEGER PRIMARY KEY,
pathname UNIQUE
);
and referring to its with a foreign key in the file table.
CREATE TABLE Files (
pathid INTEGER
CONSTRAINT fk_path REFERENCES Pathnames (pathid)
ON INSERT RESTRICT
ON UPDATE RESTRICT,
filename TEXT,
istarget INTEGER,
PRIMARY KEY (filepathid,filename,istarget)
);
This reduces the overall database size, improves the amount
of unique data in a database page, and above that comparing
integers is faster than strings.
I can't help noticing that comparing two directories in the
same system would cause the source and target paths to be
different by definition (assuming the filepath column
represents the absolute path), but I guess you are aware of
that :)
>Can someone help me?
>
>Regards,
>Jonas
--
( Kees Nuyt
)
c[_]
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users