I am building an electronic document management system using django and
mysql. The purpose of this system is to catalog and retrieve electronic
documents. There is no inventory of the physical documents. There are about
3,000 documents (photos, pdfs, videos) now, and that number could double in
the next year. The purpose of the site is to search for a set of documents
that share some given set of metadata values. I am wrestling with the model
design for the metadata part of this project.

I have identified about 30 internal metadata types particular to these
documents, and not related to any standards. If I use Dublin Core, there
are 71 types, but I would not need all of them, and there is some overlap
with the 30 internal metadata types - say another 25 types. Schema.org also
has an interesting set of metadata tags, and some may be applicable to this
project, and using that standard as well, could bring the number of types
to 75.

As I see it, I have three options.

Option 1 - traditional database design, with metadata types as columns, and
values in rows. The document model would have approx 50 - 80 fields, and
one could expect out of 3,000 documents about 50-70% of the values are
Null.

This option is the most straight forward to create. I have no idea what the
performance would be for django and mysql, searching for a set of documents
in a model with 50-80 fields and 3,000 entries, where 60% are Null. Is this
a feasible design, or do I need to think of another way to model this
metadata? Will this scale to 6,000 documents? What if I need to add another
metadata standard to the site in the future? (I am not looking for job
security in this project)

Option 2 - some sort of EAV design, with a table that has document_id,
metadatatype_id, and metadatavalue_id. I have read all sorts of posts about
the evils of EAV, but they seem to have peaked about 8 years ago, and now I
am seeing some posts that even recommend this design approach.

Is this approach a better fit for my particular application? Will it scale
to the numbers of documents and metadata I am thinking about? I really
don't want to go to the dark side if I can help it.

Option 3 - A hybrid solution that I have also read about. Put the more
common metadata types as fields in the model (columns in the database
table), and then JSON-ize the rest of the metadata for that document in a
string field in the model. This seems to put a rather large burden on the
application code when searching for a set of documents with a particular
set of metadata values.

The performance cost of this method seems to depend heavily on the
distribution of metadata types in the entire document set, and being smart
enough to pick the best common ones. If there were a set of metadata types
that span 90% of the the documents, then this would probably be a good
solution. Span only 25% of the documents, and there is a lot of comparing
values in the JSON string.

Thanks for any insights/recommendations you can provide!

Mark

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/CAEqej2MYmfqaA1Gq7nByjGo%3DcxF6_x7fbz4cuRy2EgrCBmuDVA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to