Hi guys, I had planned on writing something about this on Monday. But then I got busy, and now I have family in town visiting for a while, and it's all very annoying that I don't have lots of time for hacking. :)
As I said on IRC, I worked a bit on prototyping some ideas which I'll
explain below.
On Sun, 2005-08-21 at 14:12 +0200, Dirk Meyer wrote:
> I'm no database expert. But when you put everything in one table,
> wouldn't it be slow? I mean, album, artist, length, etc. Maybe idea
> was that is is faster to do one select to get the basic attributes for
> all audio files and after that do individual selects on the attribute
> table. Maybe I'm wrong. I guess, one file will have about 20
> attributes.
My experience with sqlite has consistently shown that operation time
scales with the number of rows returned (when the WHERE clause deals
with indexed columns). In every case, it is faster to use columns for
metadata attributes than to have a metadata table. For a file type with
10 attributes, the many-columns approach will return 1 row for each
file. The generic metadata table would return 10 rows for each file.
This holds true even when searching multiple tables. More generally,
searching multiple tables WHERE dir_id=XXX where each file is a single
row that has on average C columns which returns N rows will be faster
than searching a single table which returns C*N rows. It should in fact
be faster by a factor of C, or thereabouts.
My thinking is very much in line with dischi's. Have a table for each
file type. These tables will have columns for important metadata (i.e.
metadata that you'll want to search on later). These tables will also
have a pickle column to store attributes we're not interested in
searching on.
But I extended this approach further by making everything dynamic. The
vfs handles reconstructing the database tables as needed. The
application registers file types and attributes for that filetype with
the vfs. If those attributes are already registered, this is
essentially a no-op. Otherwise, it will rebuild the tables as necessary
to reflect the new attributes.
For my prototype, attributes can be one of the following:
1. ATTR_SIMPLE: an attribute of arbitrary type that cannot be
searched on. This is stored as a item in a pickled dict that is
stored with the row.
2. ATTR_SEARCHABLE: a dedicated column in the database is created
for this type. This means you can do queries on this column.
It is not indexed so queries on this column may be slow.
3. ATTR_INDEXED: A separate column as in ATTR_SEARCHABLE, but an
index is added as well, so queries will be much quicker.
4. ATTR_KEYWORDS: This column is indexed for keyword searches.
This will make use of my text search code that I did some months
ago, and the results were very good. Basically, you'd use
ATTR_KEYWORDS for filenames, comment fields, artist or album
fields for music, etc.
I didn't get to implementing ATTR_KEYWORDS yet. But the theory is
good. :) A lot of that code I've already done elsewhere (especially the
parsing and scoring code).
So the application will do something like this:
db = DBOverlord()
db.register_file_type_attrs("image", (
("width", int, ATTR_SEARCHABLE),
("height", int, ATTR_SEARCHABLE),
("date", int, ATTR_SEARCHABLE),
("comment", str, ATTR_KEYWORDS))
)
This registers the type "image" with the vfs, and those attributes
(width, height, etc.) of the given type and attribute flag. All of
those attributes will get separate columns in the table. If I had used
ATTR_SIMPLE, they wouldn't have columns but would be stored in the
pickled dict.
It also supports custom metadata attributes for directories.
I've attached my current prototype code. I do think this is the way to
go. It gives you the flexibility of being able to introduce arbitrary
attributes at run-time, with the performance characteristics of the
multi-table, column-per-attribute approach.
As for separate process vs. separate thread, my experience with sqlite
is that locking is cumbersome and I had run into problems. We have to
keep in mind that multiple processes will want to access the vfs (like
the web server, or record server, or whatever). Reading isn't so bad,
but if multiple processes want to write to the vfs, it could cause
problems with locking. We could make the vfs server run as as thread in
the main application so that client calls for Freevo's UI will be as
fast as possible, but for the record server and web server, some IPC
method should be possible. I'm not 100% sure what the best way to do
this is, but we need to keep it in mind.
Anyway, try out the attached code. The class there, DBOverlord (whose
name is obviously kind of a joke :)), just handles database stuff. At
that level, it has no idea about the filesystem. That is, layered on
top of that would be the actual VFS, which does know about the
filesystem and can stat files, walk trees, etc., and plugged into the
VFS would be modules to handle indexing the specific file types.
I won't be around IRC much in the next while because of my family
visiting, but I'll try to check in once in a while.
Jason.
import string, re, os, time, sys, math, cPickle, stat, types
import kaa.metadata
from pysqlite2 import dbapi2 as sqlite
CREATE_SCHEMA = """
CREATE TABLE meta (
attr TEXT UNIQUE,
value TEXT
);
INSERT INTO meta VALUES('filecount', 0);
INSERT INTO meta VALUES('version', 0.1);
CREATE TABLE types (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE,
attrs_pickle TEXT
);
CREATE TABLE dirs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
dirname TEXT UNIQUE,
mtime INTEGER,
pickle TEXT
);
CREATE INDEX dirs_dirname_idx ON dirs (dirname);
CREATE TABLE words (
id INTEGER PRIMARY KEY AUTOINCREMENT,
word TEXT,
count INTEGER
);
CREATE UNIQUE INDEX words_idx on WORDS (word) ON CONFLICT REPLACE;
CREATE TABLE words_map (
rank INTEGER,
word_id INTEGER,
file_type INTEGER,
file_id INTEGER,
frequency FLOAT
);
CREATE INDEX words_map_idx ON words_map (word_id, rank, file_type);
"""
ATTR_SIMPLE = 0x00
ATTR_SEARCHABLE = 0x01 # Is a SQL column, not a pickled field
ATTR_INDEXED = 0x02 # Will have an SQL index
ATTR_KEYWORDS = 0x04 # Also indexed for keyword queries
class DBOverlord:
def __init__(self, dbfile = None):
if not dbfile:
dbfile = "mediadb.sqlite"
self._file_types = {}
self._dir_id_cache = {} # Map path names to database id
self._dbfile = dbfile
self._open_db()
def __del__(self):
self._db.commit()
def _open_db(self):
self._db = sqlite.connect(self._dbfile)
self._cursor = self._db.cursor()
self._cursor.execute("PRAGMA synchronous=OFF")
self._cursor.execute("PRAGMA count_changes=OFF")
self._cursor.execute("PRAGMA cache_size=50000")
if not self.check_table_exists("meta"):
self._db.close()
self._create_db()
self._load_file_types()
def _db_query(self, statement, args = ()):
self._cursor.execute(statement, args)
rows = self._cursor.fetchall()
return rows
def _db_query_row(self, statement, args = ()):
rows = self._db_query(statement, args)
if len(rows) == 0:
return None
return rows[0]
def check_table_exists(self, table):
res = self._db_query_row("SELECT name FROM sqlite_master where " \
"name=? and type='table'", (table,))
return res != None
def _create_db(self):
try:
os.unlink(self._dbfile)
except:
pass
f = os.popen("sqlite3 %s" % self._dbfile, "w")
f.write(CREATE_SCHEMA)
f.close()
self._open_db()
dir_type_attrs = {
"dirname": (str, ATTR_INDEXED),
"mtime": (int, ATTR_SEARCHABLE),
"pickle": (str, ATTR_SEARCHABLE),
}
self._db_query("INSERT INTO types VALUES(NULL, 'dir', ?)",
(cPickle.dumps(dir_type_attrs),))
self._file_types["dir"] = self._cursor.lastrowid, dir_type_attrs
def register_file_type_attrs(self, type_name, attr_list):
if type_name in self._file_types:
# This type already exists. Compare given attributes with
# existing attributes for this type.
cur_type_id, cur_type_attrs = self._file_types[type_name]
new_attrs = {}
db_needs_update = False
for name, type, flags in attr_list:
if name not in cur_type_attrs:
new_attrs[name] = type, flags
if flags:
# New attribute isn't simple, needs to alter table.
db_needs_update = True
if len(new_attrs) == 0:
# All these attributes are already registered; nothing to do.
return
if not db_needs_update:
# Only simple (i.e. pickled only) attributes are added, so we
# don't need to alter the table, just update the types table.
cur_type_attrs.update(new_attrs)
self._db_query("UPDATE types SET attrs_pickle=? WHERE id=?",
(cPickle.dumps(cur_type_attrs), cur_type_id))
return
# Update the attr list to merge both existing and new attributes.
# We need to update the database now.
attr_list = []
for name, (type, flags) in cur_type_attrs.items() + new_attrs.items():
attr_list.append((name, type, flags))
else:
new_attrs = {}
# Merge standard attributes with user attributes for this type.
if type_name == "dir":
attr_list = (
("dirname", str, ATTR_INDEXED),
("mtime", int, ATTR_SEARCHABLE),
("pickle", str, ATTR_SEARCHABLE),
) + tuple(attr_list)
else:
attr_list = (
("filename", str, ATTR_SEARCHABLE),
("dir_id", int, ATTR_INDEXED),
("size", int, ATTR_SIMPLE),
("mtime", int, ATTR_SEARCHABLE),
("pickle", str, ATTR_SEARCHABLE),
) + tuple(attr_list)
if type_name == "dir":
table_name = "dirs"
else:
table_name = "files_%s" % type_name
create_stmt = """
CREATE TABLE %s_tmp (
id INTEGER PRIMARY KEY AUTOINCREMENT
""" % table_name
# Iterate through type attributes and append to SQL create statement.
attrs = {}
for name, type, flags in attr_list:
# If flags is non-zero it means this attribute needs to be a
# column in the table, not a pickled value.
if flags:
sql_types = {str: "TEXT", int: "INTEGER", float: "FLOAT"}
assert(type in sql_types)
create_stmt += ", %s %s" % (name, sql_types[type])
attrs[name] = (type, flags)
create_stmt += ")"
self._db_query(create_stmt)
# Add this type to the types table, including the attributes
# dictionary.
self._db_query("INSERT OR REPLACE INTO types VALUES(NULL, ?, ?)",
(type_name, cPickle.dumps(attrs)))
if new_attrs:
# Migrate rows from old table to new one.
columns = filter(lambda x: cur_type_attrs[x][1], cur_type_attrs.keys())
columns = string.join(columns, ",")
self._db_query("INSERT INTO %s_tmp (%s) SELECT %s FROM %s" % \
(table_name, columns, columns, table_name))
# Delete old table.
self._db_query("DROP TABLE %s" % table_name)
# Rename temporary table.
self._db_query("ALTER TABLE %s_tmp RENAME TO %s" % \
(table_name, table_name))
if type_name != "dir":
self._db_query("CREATE UNIQUE INDEX %s_file_idx on %s (dir_id, "\
"filename)" % (table_name, table_name))
# If any of these attributes need to be indexed, create the index
# for that column. TODO: need to support indexes on multiple
# columns.
for name, type, flags in attr_list:
if flags & ATTR_INDEXED:
self._db_query("CREATE INDEX %s_%s_idx ON %s (%s)" % \
(table_name, name, table_name, name))
self._load_file_types()
def _load_file_types(self):
for id, name, attrs in self._db_query("SELECT * from types"):
self._file_types[name] = id, cPickle.loads(attrs.encode("utf-8"))
def get_directory(self, dirname):
"""
Get the given directory, using the cache if possible. Caller is
responsible to ensure dirname is normalized.
"""
if dirname in self._dir_id_cache:
return self._dir_id_cache[dirname]
row = self._db_query_row("SELECT * FROM dirs WHERE dirname=?", (dirname,))
if row == None:
# Don't cache negative results
return None
dir = {"id": row[0], "mtime": row[2]}
dir.update(cPickle.loads(row[3].encode("utf-8")))
self._dir_id_cache[dirname] = dir
return dir
def get_directory_id(self, dirname):
dir = self.get_directory(dirname)
if dir:
return dir["id"]
def _make_query_from_attrs(self, query_type, attrs, type_name):
type_attrs = self._file_types[type_name][1]
columns = []
values = []
placeholders = []
for key in attrs:
if attrs[key] == None:
del attrs[key]
attrs_copy = attrs.copy()
for name, (type, flags) in type_attrs.items():
if flags != ATTR_SIMPLE:
columns.append(name)
placeholders.append("?")
if name in attrs:
values.append(attrs[name])
del attrs_copy[name]
else:
values.append('')
if len(attrs_copy) > 0:
values[columns.index("pickle")] = cPickle.dumps(attrs_copy)
else:
values[columns.index("pickle")] = None
if type_name == "dir":
table_name = "dirs"
else:
table_name = "files_" + type_name
if query_type == "add":
columns = string.join(columns, ",")
placeholders = string.join(placeholders, ",")
q = "INSERT INTO %s (%s) VALUES(%s)" % (table_name, columns, placeholders)
else:
q = "UPDATE %s SET " % table_name
for col, ph in zip(columns, placeholders):
q += "%s=%s," % (col, ph)
# Trim off last comma
q = q[:-1]
if type_name == "dir":
q += " WHERE dirname=?"
values.append(values[columns.index("dirname")])
else:
q += " WHERE dir_id=? AND filename=?"
values.append(values[columns.index("dir_id")])
values.append(values[columns.index("filename")])
# TODO: keyword indexing for ATTR_KEYWORDS attributes.
return q, values
def _set_directory(self, action, dirname, **attrs):
attrs.update({"dirname": dirname})
query, values = self._make_query_from_attrs(action, attrs, "dir")
self._db_query(query, values)
if action == "add":
dir = { "id": self._cursor.lastrowid }
dir.update(attrs)
self._dir_id_cache[dirname] = dir
elif dirname in self._dir_id_cache:
self._dir_id_cache[dirname].update(attrs)
def add_directory(self, dirname, **attrs):
"""
Add a directory; caller's responsibility to ensure dirname is
normalized. dirname must not exist in the database.
"""
self._set_directory("add", dirname, **attrs)
def update_directory(self, dirname, **attrs):
row = self._db_query_row("SELECT pickle FROM dirs WHERE dirname=?",
(dirname,))
assert(row)
if row[0]:
attrs.update(cPickle.loads(row[0].encode("utf-8")))
self._set_directory("update", dirname, **attrs)
def _set_file(self, action, file_type, dir_id, filename, **attrs):
assert(file_type in self._file_types)
attrs.update({
"dir_id": dir_id,
"filename": filename,
})
query, values = self._make_query_from_attrs(action, attrs, file_type)
self._db_query(query, values)
return attrs
def add_file(self, file_type, dir_id, filename, **attrs):
"""
Add a file to the database; file_type is the file type as given to
register_file_type_attrs(); dir_id is the directory id as gotten with
get_directory_id() or add_directory(); file_name is the file name
including extension. attrs kwargs will vary based on file_type.
"""
return self._set_file("add", file_type, dir_id, filename, **attrs)
def update_file(self, file_type, dir_id, filename, **attrs):
row = self._db_query_row("SELECT pickle FROM files_%s WHERE dir_id=? AND filename=?" % file_type,
(dir_id, filename))
assert(row)
if row[0]:
attrs.update(cPickle.loads(row[0].encode("utf-8")))
return self._set_file("update", file_type, dir_id, filename, **attrs)
def dir(self, dir_id):
# XXX: this isn't how you'd want to do this, just testing ...
results = []
for type_name, (type_id, type_attrs) in self._file_types.items():
if type_name == "dir":
continue
columns = filter(lambda x: type_attrs[x][1], type_attrs.keys())
q = "SELECT %s FROM files_%s WHERE dir_id=?" % \
(string.join(columns, ","), type_name)
rows = self._db_query(q, (dir_id,))
columns_dict = {}
for col, n in zip(columns, range(len(columns))):
columns_dict[col] = n
for row in rows:
#if row[columns_dict["pickle"]]:
# row = list(row)
results.append((columns_dict, type_name, list(row)))
return results
db = DBOverlord()
db.register_file_type_attrs("image", (
("width", int, ATTR_SEARCHABLE),
("height", int, ATTR_SEARCHABLE),
("date", int, ATTR_SEARCHABLE),
("comment", str, ATTR_KEYWORDS))
)
db.register_file_type_attrs("audio", (
("title", str, ATTR_KEYWORDS),
("artist", str, ATTR_KEYWORDS | ATTR_INDEXED),
("album", str, ATTR_KEYWORDS),
("genre", str, ATTR_INDEXED),
("samplerate", int, ATTR_SIMPLE),
("length", int, ATTR_SIMPLE),
("bitrate", int, ATTR_SIMPLE),
("trackno", int, ATTR_SIMPLE))
)
db.register_file_type_attrs("audio", (
("trackno2", int, ATTR_SEARCHABLE),)
)
db.register_file_type_attrs("dir", (
("trackno2", int, ATTR_SEARCHABLE),)
)
"""
db.add_directory("/home/tack")
db.update_directory("/home/tack", mtime=344, trackno2 = 42)
id = db.get_directory_id("/home/tack")
assert(id != None)
for i in xrange(5000):
db.add_file("image", id, "foobar%s.jpg" % i, width=100, height=100, frobate="asdf")
"""
t0=time.time()
files = db.dir(1)
print len(files)
print time.time()-t0
t0=time.time()
for d, tp, row in files:
a=row[d["filename"]]
print "IteR", time.time()-t0
signature.asc
Description: This is a digitally signed message part
