On Dec 16, 2017 4:46 PM, "Mark Phillips" <m...@phillipsmarketing.biz> wrote:

I am building an electronic document management system using django and
mysql. The purpose of this system is to catalog and retrieve electronic
documents. There is no inventory of the physical documents. There are about
3,000 documents (photos, pdfs, videos) now, and that number could double in
the next year. The purpose of the site is to search for a set of documents
that share some given set of metadata values. I am wrestling with the model
design for the metadata part of this project.

I have identified about 30 internal metadata types particular to these
documents, and not related to any standards. If I use Dublin Core, there
are 71 types, but I would not need all of them, and there is some overlap
with the 30 internal metadata types - say another 25 types. Schema.org also
has an interesting set of metadata tags, and some may be applicable to this
project, and using that standard as well, could bring the number of types
to 75.


While I'm not familiar with these collections of metadata values that you
mentioned, if there is any chance that you'll need such standardization for
interoperability with external systems and/or standardized reporting, I'd
recommend going with the one that best fits your data. Amending extra
columns would give you the flexibility to add other attributes that don't
fit a particular standard.


As I see it, I have three options.

Option 1 - traditional database design, with metadata types as columns, and
values in rows. The document model would have approx 50 - 80 fields, and
one could expect out of 3,000 documents about 50-70% of the values are
Null.

This option is the most straight forward to create. I have no idea what the
performance would be for django and mysql, searching for a set of documents
in a model with 50-80 fields and 3,000 entries, where 60% are Null. Is this
a feasible design, or do I need to think of another way to model this
metadata? Will this scale to 6,000 documents? What if I need to add another
metadata standard to the site in the future? (I am not looking for job
security in this project)


With the numbers you are taking about here, you'll see zero performance
hit. Circle back around when you get close to hundreds of columns and/or
millions of rows.

Just be sure that you have heavy indexing enabled on all columns that will
be commonly used in search queries. This sounds like a high-volume read and
low-volume write database, which is the best kind. At the size you're
talking about, you can probably have the entire thing cached in memory and
make it super fast.



Option 2 - some sort of EAV design, with a table that has document_id,
metadatatype_id, and metadatavalue_id. I have read all sorts of posts about
the evils of EAV, but they seem to have peaked about 8 years ago, and now I
am seeing some posts that even recommend this design approach.

Is this approach a better fit for my particular application? Will it scale
to the numbers of documents and metadata I am thinking about? I really
don't want to go to the dark side if I can help it.


I have no experience with EAV design, but from a quick read, your concerns
about scale appear to be the opposite of what you think. I don't think you
have enough data to introduce the complexity of EAV and see any benefit.
EAV is designed for highly optimized Big Data databases that have millions
or billions of rows.


Option 3 - A hybrid solution that I have also read about. Put the more
common metadata types as fields in the model (columns in the database
table), and then JSON-ize the rest of the metadata for that document in a
string field in the model. This seems to put a rather large burden on the
application code when searching for a set of documents with a particular
set of metadata values.

The performance cost of this method seems to depend heavily on the
distribution of metadata types in the entire document set, and being smart
enough to pick the best common ones. If there were a set of metadata types
that span 90% of the the documents, then this would probably be a good
solution. Span only 25% of the documents, and there is a lot of comparing
values in the JSON string.


This one has potential, but it would need to be a judgement call on your
part. If you are hitting the JSON attributes for more than 25% (or whatever
threshold you choose) of your standard searches, then you'll need to add
more columns.


Thanks for any insights/recommendations you can provide!


Have you thought of using a document store database like MongoDB? It would
appear your use case is perfect for it. Of course you wouldn't have the
advantage of the Django ORM, but I wouldn't let that keep you from using
the right tool. There should be swaths of sites on it and dealing with your
exact use case. To me, trying to crowbar this in to an RDBMS (even Postgres
with HSTORE support) seems rigid, and would force you to customize every
single piece of the search process. Other indexing tools like Elastisearch
or Haystack may also contribute. When dealing with the actual data, don't
let Django be the limiting factor. You can still use MySQL/Postgres for the
non-IP data like user accounts and session management.

-James

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/CA%2Be%2BciXmRU17FY3um8rCqumUqvWudFODq2Yq2G7TjWyrynrLMA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to