[google-appengine] Re: Datastore quota usage far exceeds sum size of items in datastore

Nick Johnson (Google) Mon, 27 Jul 2009 10:52:32 -0700

On Mon, Jul 27, 2009 at 6:31 PM, Calvin Slayden <calvin.slay...@gmail.com>wrote:


>
> Cool, thanks.
>
> Dunno why I didn't think to use blobs--I had forgotten there was a
> binary datatype, and encoded sequences as string representations of
> their base-5 (there's a fifth option for a base value, "N", meaning
> "unknown") value. Seemed like a silly optimization at the time, until
> I saw my quota usage. ;)
>
> Even if indices are more compact than tables *per* *record*, the
> implicit encoding of (chromosome, plusminus) in the kind name should
> save space over combining them into a single index (unless the
> overhead of creating a new table is itself enormous, which it doesn't
> appear to be).


True. When I answered the question intially, I thought you were discussing
using an Entity kind with indices, vs using a new Entity kind _as_ an index.


>
>
> Granted, all of these are seemingly silly, overaggressive
> optimizations, but in my real app, my quota usage is at least an order
> of magnitude over the size of the uploaded data themselves.


I would definitely recommend plucking the low-hanging fruit - like not
indexing the sequence itself, and combining plusminus and chromosome, and
shortening kind and field names, first. If that doesn't save enough,
consider splitting into multiple kinds as a last resort. Bear in mind that
you are always going to have significant encoding overhead, though - that's
the price you pay for indexing.


>
>
> Final question: Is there a faster way to figure out database usage
> than waiting 24 hours for the quota page on the dashboard to update?
> The cycle of coding+uploading data+waiting for the dashboard is not a
> rapid development iteration.


Unfortunately, not currently. You can estimate it by looking at the encoded
Protocol Buffers, though.

-Nick Johnson


>
>
> Thanks for all the help.
>
> Calvin
>
> On Jul 27, 10:09 am, "Nick Johnson (Google)" <nick.john...@google.com>
> wrote:
> > On Mon, Jul 27, 2009 at 5:57 PM, Calvin Slayden <
> calvin.slay...@gmail.com>wrote:
> >
> >
> >
> > > Nick,
> >
> > > First, thanks for the prompt reply (especially since my question is
> > > essentially about how to keep my app engine bill down!).
> >
> > > So the names are serialized with each record? In my example above,
> > > "int1" requires an additional 8 bytes * [number of records] for
> > > storage?  (Assuming UTF-16 here.)
> >
> > Actually, all strings are stored as utf-8. The protocol buffer format
> > requires at least one byte length prefix, plus another byte to indicate
> the
> > field ID.
> >
> > The reason this is necessary is because App Engine doesn't impose a
> schema
> > on your data, so it has to store the mapping of names to values with each
> > Entity stored.
> >
> > If you want a detailed understanding of how data is stored, Ryan
> Barrett's
> > 'under the covers' talk is a good start:
> http://sites.google.com/site/io/under-the-covers-of-the-google-app-en....
> > Reading the publically available documentation on how Protocol Buffers
> > are
> > encoded is also a useful thing to do.
> >
> > > Wow. (If it's not obvious, I have an
> > > application with a very large number of very small records, so fixed
> > > per-record overhead is fairly costly.)
> >
> > > I suspected indices were the principle cost (and that's entirely
> > > reasonable). But if you don't mind, I have a couple of more in-depth
> > > questions.
> >
> > > First, background (this is closer to my actual application). Assume I
> > > have a database of genome subsequences, where each record contains a
> > > sequence, a position, and a chromosome number. My current schema looks
> > > like this:
> >
> > > class Record(db.Model):
> > >   seq = db.StringProperty()
> >
> > If you don't need this field indexed, set indexed=False, or make it a
> > TextProperty. If you're further concerned about space, you could even
> define
> > your own Property subclass that stores the sequence as a blob, using only
> 2
> > bits per base, and encodes and decodes automatically.
> >
> >
> >
> > >   location = db.IntegerProperty(indexed=True)
> > >   chromosome = db.IntegerProperty(indexed=True)
> > >   plusminus = db.BooleanProperty(indexed=True)
> >
> > > Now, I query on location and chromosome, so both of those are
> > > currently indexed. But chromosome is only one of, say, five values.
> >
> > > So my question is, are indices equivalent--in terms of space--to a
> > > table of the property being indexed and the key of the record?
> >
> > Indices are more compact - both in terms of encoding, and because storing
> an
> > Entity with any indexed fields also results in the insertion of values
> into
> > the automatic indexes.
> >
> >
> >
> > > To expand, is the above (current) schema less space efficient than
> > > creating five separate tables, implicitly encoding the chromosome
> > > value in the table name?
> >
> > > e.g.:
> >
> > > class RecordChromosome1(db.Model):
> > >    seq = db.StringProperty()
> > >    location = db.IntegerProperty()
> >
> > This is certainly an alternative option, especially if you don't need any
> > custom indexes. It will save some space, though I'm not sure if it will
> be
> > significant.
> >
> > You can even use Python's dynamic nature to facilitate this:
> >
> > ---
> > def generateChromosomeModel(n):
> >   class Record(db.Model):
> >     @staticmethod
> >     def kind():
> >       return "c%d" % n
> >     seq = db.StringProperty()
> >     location = db.IntegerProperty()
> >   return Record
> >
> > RecordChromosome1, RecordChromosome2, RecordChromosome3,
> RecordChromosome4,
> > RecordChromosome5 = [generateChromosomeModel(x) for x in range(1, 6)]
> > ---
> >
> > I wouldn't recommend trying this until you've tried other optimisations
> and
> > determined that they're not effective enough, though.
> >
> >
> >
> > > In fact, it's entirely feasible to manually "unroll" (like loop
> > > unrolling, I guess?) the query combinatorics of the chromosome and
> > > plusminus properties and create ten different tables, wherein each
> > > record contains only the sequence and the location (but not the
> > > plusminus and chromosome properties, which are now implicitly encoded
> > > in the record's kind).
> >
> > > Would doing this save space by obviating the indices for "plusminus"
> > > and "location"?
> >
> > It would. Again, I'm not certain how significant these savings would be.
> A
> > simpler optimisation would be to combine "plusminus" and "location" by
> > having a single location value that varies from -5 to 5, for example.
> >
> >
> >
> > > And, while I'm at it, is the kind name also encoded in each record
> > > instance, or only the property names?
> >
> > It is. See above for how to change the kind name without changing the
> class
> > name.
> >
> > -Nick Johnson
> >
> >
> >
> >
> >
> >
> >
> > > Thanks again,
> > > Calvin
> >
> > > On Jul 27, 7:12 am, "Nick Johnson (Google)" <nick.john...@google.com>
> > > wrote:
> > > > Hi Calvin,
> >
> > > > On Sun, Jul 26, 2009 at 6:36 PM, Calvin Slayden <
> > > calvin.slay...@gmail.com>wrote:
> >
> > > > > Hello,
> >
> > > > > After noticing my app using far more of the datastore quota than I
> > > > > expected, I conducted a little experiment with a toy app. The toy
> > > > > creates records of the format:
> >
> > > > > class Record(db.Model):
> > > > >    int1 = db.IntegerProperty()
> > > > >    # ... repeated three more times
> >
> > > > > According to the App Engine documentation, IntegerProperty types in
> > > > > the datastore are stored as 64-bit integers, so each record
> consists
> > > > > of 8 * 4 = 32 bytes of "payload", plus overhead (key size * number
> of
> > > > > records, etc). I was hoping, with this experiment, to gain some
> > > > > insight into what that overhead costs.
> >
> > > > > I uploaded 141,725 (sorry it's not a round number) of these records
> > > > > and waited a day for my quota to update; it now shows me using
> 0.09GB
> > > > > of my stored data quota. (In comparison, the amount of "payload"
> data
> > > > > in the store is only 141,725 x 32 = 4,535,200 bytes, or ~0.004GB.)
> >
> > > > > I'm not sure whether it's best to compute overhead-per-property or
> > > > > overhead-per-record (there is obviously a nontrivial cost to the
> > > > > former, of course), and determining which is actually using the
> extra
> > > > > quota would probably take more effort than it's worth, so I figured
> > > > > I'd ask here.
> >
> > > > > Anyway, in terms of overhead, I'm observing (in this case)
> > > > > approximately 600 bytes per record (in addition to the size of the
> > > > > record's payload), or 150 bytes per property. In my particular case
> > > > > (though I doubt this ratio is anything resembling constant), I'm
> > > > > seeing 19 bytes of quota used for every byte of payload.
> >
> > > > > So I guess I have two big questions.
> >
> > > > > First, quota obviously measures something resembling the total size
> of
> > > > > the datastore tables created by the user (their size on disk?),
> which,
> > > > > due to indexing and such, is obviously going to be larger than the
> > > > > size of the payload alone. Right?
> >
> > > > More or less, yes. The total size includes the size of the serialized
> > > Entity
> > > > Protocol Buffer (which includes all the fields' values, their names
> > > ('int1'
> > > > in your case), plus some encoding overhead), as well as several
> built-in
> > > > indexes, plus any custom indexes you've built.
> >
> > > > > Second, how can I minimize this? Is the primary cost the added
> > > > > indices, such that if I disable indexing for properties I won't be
> > > > > querying on, I'll save space? (Again, this is something I could
> > > > > experiment with, but at this point, why not just ask?) What's the
> > > > > overhead per record, regardless of indexing? Are there any other
> steps
> > > > > I can take to minimize my datastore quota usage?
> >
> > > > Disabling indexing for fields you don't require to be indexed is the
> > > > simplest and most effective step. You can also use shorter field
> names,
> > > to
> > > > reduce redundant storage requirements there, if you wish. To make
> things
> > > > more user friendly, you can specify a name in your model:
> >
> > > >   class MyModel(db.Model):
> > > >     areallylongpropertyname = db.IntegerProperty(name='a')
> >
> > > > But bear in mind that you'll have to use the shortened names when
> > > executing
> > > > queries.
> >
> > > > -Nick Johnson
> >
> > > > > (The application I'm actually developing stores a few hundreds of
> > > > > megabytes of data in three tables--two of them only necessary
> because
> > > > > of the limitations on the types of queries datastore can
> handle--and
> > > > > while when I was developing the application I estimated that we
> would
> > > > > be able to stay well within an affordable amount of quota usage, it
> > > > > now appears that we would be using something like 200GB of
> > > > > datastore.)
> >
> > > > > Thanks,
> > > > > Calvin
> >
> > > > --
> > > > Nick Johnson, Developer Programs Engineer, App Engine
> >
> > --
> > Nick Johnson, Developer Programs Engineer, App Engine
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~----------~----~----~----~------~----~------~--~---

[google-appengine] Re: Datastore quota usage far exceeds sum size of items in datastore

Reply via email to