On Mon, Jul 27, 2009 at 6:31 PM, Calvin Slayden <calvin.slay...@gmail.com>wrote:
> > Cool, thanks. > > Dunno why I didn't think to use blobs--I had forgotten there was a > binary datatype, and encoded sequences as string representations of > their base-5 (there's a fifth option for a base value, "N", meaning > "unknown") value. Seemed like a silly optimization at the time, until > I saw my quota usage. ;) > > Even if indices are more compact than tables *per* *record*, the > implicit encoding of (chromosome, plusminus) in the kind name should > save space over combining them into a single index (unless the > overhead of creating a new table is itself enormous, which it doesn't > appear to be). True. When I answered the question intially, I thought you were discussing using an Entity kind with indices, vs using a new Entity kind _as_ an index. > > > Granted, all of these are seemingly silly, overaggressive > optimizations, but in my real app, my quota usage is at least an order > of magnitude over the size of the uploaded data themselves. I would definitely recommend plucking the low-hanging fruit - like not indexing the sequence itself, and combining plusminus and chromosome, and shortening kind and field names, first. If that doesn't save enough, consider splitting into multiple kinds as a last resort. Bear in mind that you are always going to have significant encoding overhead, though - that's the price you pay for indexing. > > > Final question: Is there a faster way to figure out database usage > than waiting 24 hours for the quota page on the dashboard to update? > The cycle of coding+uploading data+waiting for the dashboard is not a > rapid development iteration. Unfortunately, not currently. You can estimate it by looking at the encoded Protocol Buffers, though. -Nick Johnson > > > Thanks for all the help. > > Calvin > > On Jul 27, 10:09 am, "Nick Johnson (Google)" <nick.john...@google.com> > wrote: > > On Mon, Jul 27, 2009 at 5:57 PM, Calvin Slayden < > calvin.slay...@gmail.com>wrote: > > > > > > > > > Nick, > > > > > First, thanks for the prompt reply (especially since my question is > > > essentially about how to keep my app engine bill down!). > > > > > So the names are serialized with each record? In my example above, > > > "int1" requires an additional 8 bytes * [number of records] for > > > storage? (Assuming UTF-16 here.) > > > > Actually, all strings are stored as utf-8. The protocol buffer format > > requires at least one byte length prefix, plus another byte to indicate > the > > field ID. > > > > The reason this is necessary is because App Engine doesn't impose a > schema > > on your data, so it has to store the mapping of names to values with each > > Entity stored. > > > > If you want a detailed understanding of how data is stored, Ryan > Barrett's > > 'under the covers' talk is a good start: > http://sites.google.com/site/io/under-the-covers-of-the-google-app-en.... > > Reading the publically available documentation on how Protocol Buffers > > are > > encoded is also a useful thing to do. > > > > > Wow. (If it's not obvious, I have an > > > application with a very large number of very small records, so fixed > > > per-record overhead is fairly costly.) > > > > > I suspected indices were the principle cost (and that's entirely > > > reasonable). But if you don't mind, I have a couple of more in-depth > > > questions. > > > > > First, background (this is closer to my actual application). Assume I > > > have a database of genome subsequences, where each record contains a > > > sequence, a position, and a chromosome number. My current schema looks > > > like this: > > > > > class Record(db.Model): > > > seq = db.StringProperty() > > > > If you don't need this field indexed, set indexed=False, or make it a > > TextProperty. If you're further concerned about space, you could even > define > > your own Property subclass that stores the sequence as a blob, using only > 2 > > bits per base, and encodes and decodes automatically. > > > > > > > > > location = db.IntegerProperty(indexed=True) > > > chromosome = db.IntegerProperty(indexed=True) > > > plusminus = db.BooleanProperty(indexed=True) > > > > > Now, I query on location and chromosome, so both of those are > > > currently indexed. But chromosome is only one of, say, five values. > > > > > So my question is, are indices equivalent--in terms of space--to a > > > table of the property being indexed and the key of the record? > > > > Indices are more compact - both in terms of encoding, and because storing > an > > Entity with any indexed fields also results in the insertion of values > into > > the automatic indexes. > > > > > > > > > To expand, is the above (current) schema less space efficient than > > > creating five separate tables, implicitly encoding the chromosome > > > value in the table name? > > > > > e.g.: > > > > > class RecordChromosome1(db.Model): > > > seq = db.StringProperty() > > > location = db.IntegerProperty() > > > > This is certainly an alternative option, especially if you don't need any > > custom indexes. It will save some space, though I'm not sure if it will > be > > significant. > > > > You can even use Python's dynamic nature to facilitate this: > > > > --- > > def generateChromosomeModel(n): > > class Record(db.Model): > > @staticmethod > > def kind(): > > return "c%d" % n > > seq = db.StringProperty() > > location = db.IntegerProperty() > > return Record > > > > RecordChromosome1, RecordChromosome2, RecordChromosome3, > RecordChromosome4, > > RecordChromosome5 = [generateChromosomeModel(x) for x in range(1, 6)] > > --- > > > > I wouldn't recommend trying this until you've tried other optimisations > and > > determined that they're not effective enough, though. > > > > > > > > > In fact, it's entirely feasible to manually "unroll" (like loop > > > unrolling, I guess?) the query combinatorics of the chromosome and > > > plusminus properties and create ten different tables, wherein each > > > record contains only the sequence and the location (but not the > > > plusminus and chromosome properties, which are now implicitly encoded > > > in the record's kind). > > > > > Would doing this save space by obviating the indices for "plusminus" > > > and "location"? > > > > It would. Again, I'm not certain how significant these savings would be. > A > > simpler optimisation would be to combine "plusminus" and "location" by > > having a single location value that varies from -5 to 5, for example. > > > > > > > > > And, while I'm at it, is the kind name also encoded in each record > > > instance, or only the property names? > > > > It is. See above for how to change the kind name without changing the > class > > name. > > > > -Nick Johnson > > > > > > > > > > > > > > > > > Thanks again, > > > Calvin > > > > > On Jul 27, 7:12 am, "Nick Johnson (Google)" <nick.john...@google.com> > > > wrote: > > > > Hi Calvin, > > > > > > On Sun, Jul 26, 2009 at 6:36 PM, Calvin Slayden < > > > calvin.slay...@gmail.com>wrote: > > > > > > > Hello, > > > > > > > After noticing my app using far more of the datastore quota than I > > > > > expected, I conducted a little experiment with a toy app. The toy > > > > > creates records of the format: > > > > > > > class Record(db.Model): > > > > > int1 = db.IntegerProperty() > > > > > # ... repeated three more times > > > > > > > According to the App Engine documentation, IntegerProperty types in > > > > > the datastore are stored as 64-bit integers, so each record > consists > > > > > of 8 * 4 = 32 bytes of "payload", plus overhead (key size * number > of > > > > > records, etc). I was hoping, with this experiment, to gain some > > > > > insight into what that overhead costs. > > > > > > > I uploaded 141,725 (sorry it's not a round number) of these records > > > > > and waited a day for my quota to update; it now shows me using > 0.09GB > > > > > of my stored data quota. (In comparison, the amount of "payload" > data > > > > > in the store is only 141,725 x 32 = 4,535,200 bytes, or ~0.004GB.) > > > > > > > I'm not sure whether it's best to compute overhead-per-property or > > > > > overhead-per-record (there is obviously a nontrivial cost to the > > > > > former, of course), and determining which is actually using the > extra > > > > > quota would probably take more effort than it's worth, so I figured > > > > > I'd ask here. > > > > > > > Anyway, in terms of overhead, I'm observing (in this case) > > > > > approximately 600 bytes per record (in addition to the size of the > > > > > record's payload), or 150 bytes per property. In my particular case > > > > > (though I doubt this ratio is anything resembling constant), I'm > > > > > seeing 19 bytes of quota used for every byte of payload. > > > > > > > So I guess I have two big questions. > > > > > > > First, quota obviously measures something resembling the total size > of > > > > > the datastore tables created by the user (their size on disk?), > which, > > > > > due to indexing and such, is obviously going to be larger than the > > > > > size of the payload alone. Right? > > > > > > More or less, yes. The total size includes the size of the serialized > > > Entity > > > > Protocol Buffer (which includes all the fields' values, their names > > > ('int1' > > > > in your case), plus some encoding overhead), as well as several > built-in > > > > indexes, plus any custom indexes you've built. > > > > > > > Second, how can I minimize this? Is the primary cost the added > > > > > indices, such that if I disable indexing for properties I won't be > > > > > querying on, I'll save space? (Again, this is something I could > > > > > experiment with, but at this point, why not just ask?) What's the > > > > > overhead per record, regardless of indexing? Are there any other > steps > > > > > I can take to minimize my datastore quota usage? > > > > > > Disabling indexing for fields you don't require to be indexed is the > > > > simplest and most effective step. You can also use shorter field > names, > > > to > > > > reduce redundant storage requirements there, if you wish. To make > things > > > > more user friendly, you can specify a name in your model: > > > > > > class MyModel(db.Model): > > > > areallylongpropertyname = db.IntegerProperty(name='a') > > > > > > But bear in mind that you'll have to use the shortened names when > > > executing > > > > queries. > > > > > > -Nick Johnson > > > > > > > (The application I'm actually developing stores a few hundreds of > > > > > megabytes of data in three tables--two of them only necessary > because > > > > > of the limitations on the types of queries datastore can > handle--and > > > > > while when I was developing the application I estimated that we > would > > > > > be able to stay well within an affordable amount of quota usage, it > > > > > now appears that we would be using something like 200GB of > > > > > datastore.) > > > > > > > Thanks, > > > > > Calvin > > > > > > -- > > > > Nick Johnson, Developer Programs Engineer, App Engine > > > > -- > > Nick Johnson, Developer Programs Engineer, App Engine > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to google-appengine@googlegroups.com To unsubscribe from this group, send email to google-appengine+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en -~----------~----~----~----~------~----~------~--~---