[google-appengine] Re: Datastore quota usage far exceeds sum size of items in datastore

2009-07-27 Thread Nick Johnson (Google)
On Mon, Jul 27, 2009 at 6:31 PM, Calvin Slayden wrote:

>
> Cool, thanks.
>
> Dunno why I didn't think to use blobs--I had forgotten there was a
> binary datatype, and encoded sequences as string representations of
> their base-5 (there's a fifth option for a base value, "N", meaning
> "unknown") value. Seemed like a silly optimization at the time, until
> I saw my quota usage. ;)
>
> Even if indices are more compact than tables *per* *record*, the
> implicit encoding of (chromosome, plusminus) in the kind name should
> save space over combining them into a single index (unless the
> overhead of creating a new table is itself enormous, which it doesn't
> appear to be).


True. When I answered the question intially, I thought you were discussing
using an Entity kind with indices, vs using a new Entity kind _as_ an index.


>
>
> Granted, all of these are seemingly silly, overaggressive
> optimizations, but in my real app, my quota usage is at least an order
> of magnitude over the size of the uploaded data themselves.


I would definitely recommend plucking the low-hanging fruit - like not
indexing the sequence itself, and combining plusminus and chromosome, and
shortening kind and field names, first. If that doesn't save enough,
consider splitting into multiple kinds as a last resort. Bear in mind that
you are always going to have significant encoding overhead, though - that's
the price you pay for indexing.


>
>
> Final question: Is there a faster way to figure out database usage
> than waiting 24 hours for the quota page on the dashboard to update?
> The cycle of coding+uploading data+waiting for the dashboard is not a
> rapid development iteration.


Unfortunately, not currently. You can estimate it by looking at the encoded
Protocol Buffers, though.

-Nick Johnson


>
>
> Thanks for all the help.
>
> Calvin
>
> On Jul 27, 10:09 am, "Nick Johnson (Google)" 
> wrote:
> > On Mon, Jul 27, 2009 at 5:57 PM, Calvin Slayden <
> calvin.slay...@gmail.com>wrote:
> >
> >
> >
> > > Nick,
> >
> > > First, thanks for the prompt reply (especially since my question is
> > > essentially about how to keep my app engine bill down!).
> >
> > > So the names are serialized with each record? In my example above,
> > > "int1" requires an additional 8 bytes * [number of records] for
> > > storage?  (Assuming UTF-16 here.)
> >
> > Actually, all strings are stored as utf-8. The protocol buffer format
> > requires at least one byte length prefix, plus another byte to indicate
> the
> > field ID.
> >
> > The reason this is necessary is because App Engine doesn't impose a
> schema
> > on your data, so it has to store the mapping of names to values with each
> > Entity stored.
> >
> > If you want a detailed understanding of how data is stored, Ryan
> Barrett's
> > 'under the covers' talk is a good start:
> http://sites.google.com/site/io/under-the-covers-of-the-google-app-en
> > Reading the publically available documentation on how Protocol Buffers
> > are
> > encoded is also a useful thing to do.
> >
> > > Wow. (If it's not obvious, I have an
> > > application with a very large number of very small records, so fixed
> > > per-record overhead is fairly costly.)
> >
> > > I suspected indices were the principle cost (and that's entirely
> > > reasonable). But if you don't mind, I have a couple of more in-depth
> > > questions.
> >
> > > First, background (this is closer to my actual application). Assume I
> > > have a database of genome subsequences, where each record contains a
> > > sequence, a position, and a chromosome number. My current schema looks
> > > like this:
> >
> > > class Record(db.Model):
> > >   seq = db.StringProperty()
> >
> > If you don't need this field indexed, set indexed=False, or make it a
> > TextProperty. If you're further concerned about space, you could even
> define
> > your own Property subclass that stores the sequence as a blob, using only
> 2
> > bits per base, and encodes and decodes automatically.
> >
> >
> >
> > >   location = db.IntegerProperty(indexed=True)
> > >   chromosome = db.IntegerProperty(indexed=True)
> > >   plusminus = db.BooleanProperty(indexed=True)
> >
> > > Now, I query on location and chromosome, so both of those are
> > > currently indexed. But chromosome is only one of, say, five values.
> >
> > > So my question is, are indices equivalent--in terms of space--to a
> > > table of the property being indexed and the key of the record?
> >
> > Indices are more compact - both in terms of encoding, and because storing
> an
> > Entity with any indexed fields also results in the insertion of values
> into
> > the automatic indexes.
> >
> >
> >
> > > To expand, is the above (current) schema less space efficient than
> > > creating five separate tables, implicitly encoding the chromosome
> > > value in the table name?
> >
> > > e.g.:
> >
> > > class RecordChromosome1(db.Model):
> > >seq = db.StringProperty()
> > >location = db.IntegerProperty()
> >
> > This is certainly an alte

[google-appengine] Re: Datastore quota usage far exceeds sum size of items in datastore

2009-07-27 Thread Calvin Slayden

Cool, thanks.

Dunno why I didn't think to use blobs--I had forgotten there was a
binary datatype, and encoded sequences as string representations of
their base-5 (there's a fifth option for a base value, "N", meaning
"unknown") value. Seemed like a silly optimization at the time, until
I saw my quota usage. ;)

Even if indices are more compact than tables *per* *record*, the
implicit encoding of (chromosome, plusminus) in the kind name should
save space over combining them into a single index (unless the
overhead of creating a new table is itself enormous, which it doesn't
appear to be).

Granted, all of these are seemingly silly, overaggressive
optimizations, but in my real app, my quota usage is at least an order
of magnitude over the size of the uploaded data themselves.

Final question: Is there a faster way to figure out database usage
than waiting 24 hours for the quota page on the dashboard to update?
The cycle of coding+uploading data+waiting for the dashboard is not a
rapid development iteration.

Thanks for all the help.

Calvin

On Jul 27, 10:09 am, "Nick Johnson (Google)" 
wrote:
> On Mon, Jul 27, 2009 at 5:57 PM, Calvin Slayden 
> wrote:
>
>
>
> > Nick,
>
> > First, thanks for the prompt reply (especially since my question is
> > essentially about how to keep my app engine bill down!).
>
> > So the names are serialized with each record? In my example above,
> > "int1" requires an additional 8 bytes * [number of records] for
> > storage?  (Assuming UTF-16 here.)
>
> Actually, all strings are stored as utf-8. The protocol buffer format
> requires at least one byte length prefix, plus another byte to indicate the
> field ID.
>
> The reason this is necessary is because App Engine doesn't impose a schema
> on your data, so it has to store the mapping of names to values with each
> Entity stored.
>
> If you want a detailed understanding of how data is stored, Ryan Barrett's
> 'under the covers' talk is a good 
> start:http://sites.google.com/site/io/under-the-covers-of-the-google-app-en
> Reading the publically available documentation on how Protocol Buffers
> are
> encoded is also a useful thing to do.
>
> > Wow. (If it's not obvious, I have an
> > application with a very large number of very small records, so fixed
> > per-record overhead is fairly costly.)
>
> > I suspected indices were the principle cost (and that's entirely
> > reasonable). But if you don't mind, I have a couple of more in-depth
> > questions.
>
> > First, background (this is closer to my actual application). Assume I
> > have a database of genome subsequences, where each record contains a
> > sequence, a position, and a chromosome number. My current schema looks
> > like this:
>
> > class Record(db.Model):
> >   seq = db.StringProperty()
>
> If you don't need this field indexed, set indexed=False, or make it a
> TextProperty. If you're further concerned about space, you could even define
> your own Property subclass that stores the sequence as a blob, using only 2
> bits per base, and encodes and decodes automatically.
>
>
>
> >   location = db.IntegerProperty(indexed=True)
> >   chromosome = db.IntegerProperty(indexed=True)
> >   plusminus = db.BooleanProperty(indexed=True)
>
> > Now, I query on location and chromosome, so both of those are
> > currently indexed. But chromosome is only one of, say, five values.
>
> > So my question is, are indices equivalent--in terms of space--to a
> > table of the property being indexed and the key of the record?
>
> Indices are more compact - both in terms of encoding, and because storing an
> Entity with any indexed fields also results in the insertion of values into
> the automatic indexes.
>
>
>
> > To expand, is the above (current) schema less space efficient than
> > creating five separate tables, implicitly encoding the chromosome
> > value in the table name?
>
> > e.g.:
>
> > class RecordChromosome1(db.Model):
> >    seq = db.StringProperty()
> >    location = db.IntegerProperty()
>
> This is certainly an alternative option, especially if you don't need any
> custom indexes. It will save some space, though I'm not sure if it will be
> significant.
>
> You can even use Python's dynamic nature to facilitate this:
>
> ---
> def generateChromosomeModel(n):
>   class Record(db.Model):
>     @staticmethod
>     def kind():
>       return "c%d" % n
>     seq = db.StringProperty()
>     location = db.IntegerProperty()
>   return Record
>
> RecordChromosome1, RecordChromosome2, RecordChromosome3, RecordChromosome4,
> RecordChromosome5 = [generateChromosomeModel(x) for x in range(1, 6)]
> ---
>
> I wouldn't recommend trying this until you've tried other optimisations and
> determined that they're not effective enough, though.
>
>
>
> > In fact, it's entirely feasible to manually "unroll" (like loop
> > unrolling, I guess?) the query combinatorics of the chromosome and
> > plusminus properties and create ten different tables, wherein each
> > record contains only the sequence and the loc

[google-appengine] Re: Datastore quota usage far exceeds sum size of items in datastore

2009-07-27 Thread Nick Johnson (Google)
On Mon, Jul 27, 2009 at 5:57 PM, Calvin Slayden wrote:

>
> Nick,
>
> First, thanks for the prompt reply (especially since my question is
> essentially about how to keep my app engine bill down!).
>
> So the names are serialized with each record? In my example above,
> "int1" requires an additional 8 bytes * [number of records] for
> storage?  (Assuming UTF-16 here.)


Actually, all strings are stored as utf-8. The protocol buffer format
requires at least one byte length prefix, plus another byte to indicate the
field ID.

The reason this is necessary is because App Engine doesn't impose a schema
on your data, so it has to store the mapping of names to values with each
Entity stored.

If you want a detailed understanding of how data is stored, Ryan Barrett's
'under the covers' talk is a good start:
http://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore.
Reading the publically available documentation on how Protocol Buffers
are
encoded is also a useful thing to do.


> Wow. (If it's not obvious, I have an
> application with a very large number of very small records, so fixed
> per-record overhead is fairly costly.)
>
> I suspected indices were the principle cost (and that's entirely
> reasonable). But if you don't mind, I have a couple of more in-depth
> questions.
>
> First, background (this is closer to my actual application). Assume I
> have a database of genome subsequences, where each record contains a
> sequence, a position, and a chromosome number. My current schema looks
> like this:
>
> class Record(db.Model):
>   seq = db.StringProperty()


If you don't need this field indexed, set indexed=False, or make it a
TextProperty. If you're further concerned about space, you could even define
your own Property subclass that stores the sequence as a blob, using only 2
bits per base, and encodes and decodes automatically.


>
>   location = db.IntegerProperty(indexed=True)
>   chromosome = db.IntegerProperty(indexed=True)
>   plusminus = db.BooleanProperty(indexed=True)
>
> Now, I query on location and chromosome, so both of those are
> currently indexed. But chromosome is only one of, say, five values.
>
> So my question is, are indices equivalent--in terms of space--to a
> table of the property being indexed and the key of the record?


Indices are more compact - both in terms of encoding, and because storing an
Entity with any indexed fields also results in the insertion of values into
the automatic indexes.


>
> To expand, is the above (current) schema less space efficient than
> creating five separate tables, implicitly encoding the chromosome
> value in the table name?
>
> e.g.:
>
> class RecordChromosome1(db.Model):
>seq = db.StringProperty()
>location = db.IntegerProperty()


This is certainly an alternative option, especially if you don't need any
custom indexes. It will save some space, though I'm not sure if it will be
significant.

You can even use Python's dynamic nature to facilitate this:

---
def generateChromosomeModel(n):
  class Record(db.Model):
@staticmethod
def kind():
  return "c%d" % n
seq = db.StringProperty()
location = db.IntegerProperty()
  return Record

RecordChromosome1, RecordChromosome2, RecordChromosome3, RecordChromosome4,
RecordChromosome5 = [generateChromosomeModel(x) for x in range(1, 6)]
---

I wouldn't recommend trying this until you've tried other optimisations and
determined that they're not effective enough, though.


>
> In fact, it's entirely feasible to manually "unroll" (like loop
> unrolling, I guess?) the query combinatorics of the chromosome and
> plusminus properties and create ten different tables, wherein each
> record contains only the sequence and the location (but not the
> plusminus and chromosome properties, which are now implicitly encoded
> in the record's kind).
>
> Would doing this save space by obviating the indices for "plusminus"
> and "location"?


It would. Again, I'm not certain how significant these savings would be. A
simpler optimisation would be to combine "plusminus" and "location" by
having a single location value that varies from -5 to 5, for example.


>
>
> And, while I'm at it, is the kind name also encoded in each record
> instance, or only the property names?


It is. See above for how to change the kind name without changing the class
name.

-Nick Johnson


>
>
> Thanks again,
> Calvin
>
> On Jul 27, 7:12 am, "Nick Johnson (Google)" 
> wrote:
> > Hi Calvin,
> >
> > On Sun, Jul 26, 2009 at 6:36 PM, Calvin Slayden <
> calvin.slay...@gmail.com>wrote:
> >
> >
> >
> >
> >
> >
> >
> > > Hello,
> >
> > > After noticing my app using far more of the datastore quota than I
> > > expected, I conducted a little experiment with a toy app. The toy
> > > creates records of the format:
> >
> > > class Record(db.Model):
> > >int1 = db.IntegerProperty()
> > ># ... repeated three more times
> >
> > > According to the App Engine documentation, IntegerProperty types in
> > > the data

[google-appengine] Re: Datastore quota usage far exceeds sum size of items in datastore

2009-07-27 Thread Calvin Slayden

Nick,

First, thanks for the prompt reply (especially since my question is
essentially about how to keep my app engine bill down!).

So the names are serialized with each record? In my example above,
"int1" requires an additional 8 bytes * [number of records] for
storage?  (Assuming UTF-16 here.) Wow. (If it's not obvious, I have an
application with a very large number of very small records, so fixed
per-record overhead is fairly costly.)

I suspected indices were the principle cost (and that's entirely
reasonable). But if you don't mind, I have a couple of more in-depth
questions.

First, background (this is closer to my actual application). Assume I
have a database of genome subsequences, where each record contains a
sequence, a position, and a chromosome number. My current schema looks
like this:

class Record(db.Model):
   seq = db.StringProperty()
   location = db.IntegerProperty(indexed=True)
   chromosome = db.IntegerProperty(indexed=True)
   plusminus = db.BooleanProperty(indexed=True)

Now, I query on location and chromosome, so both of those are
currently indexed. But chromosome is only one of, say, five values.

So my question is, are indices equivalent--in terms of space--to a
table of the property being indexed and the key of the record?

To expand, is the above (current) schema less space efficient than
creating five separate tables, implicitly encoding the chromosome
value in the table name?

e.g.:

class RecordChromosome1(db.Model):
seq = db.StringProperty()
location = db.IntegerProperty()

In fact, it's entirely feasible to manually "unroll" (like loop
unrolling, I guess?) the query combinatorics of the chromosome and
plusminus properties and create ten different tables, wherein each
record contains only the sequence and the location (but not the
plusminus and chromosome properties, which are now implicitly encoded
in the record's kind).

Would doing this save space by obviating the indices for "plusminus"
and "location"?

And, while I'm at it, is the kind name also encoded in each record
instance, or only the property names?

Thanks again,
Calvin

On Jul 27, 7:12 am, "Nick Johnson (Google)" 
wrote:
> Hi Calvin,
>
> On Sun, Jul 26, 2009 at 6:36 PM, Calvin Slayden 
> wrote:
>
>
>
>
>
>
>
> > Hello,
>
> > After noticing my app using far more of the datastore quota than I
> > expected, I conducted a little experiment with a toy app. The toy
> > creates records of the format:
>
> > class Record(db.Model):
> >    int1 = db.IntegerProperty()
> >    # ... repeated three more times
>
> > According to the App Engine documentation, IntegerProperty types in
> > the datastore are stored as 64-bit integers, so each record consists
> > of 8 * 4 = 32 bytes of "payload", plus overhead (key size * number of
> > records, etc). I was hoping, with this experiment, to gain some
> > insight into what that overhead costs.
>
> > I uploaded 141,725 (sorry it's not a round number) of these records
> > and waited a day for my quota to update; it now shows me using 0.09GB
> > of my stored data quota. (In comparison, the amount of "payload" data
> > in the store is only 141,725 x 32 = 4,535,200 bytes, or ~0.004GB.)
>
> > I'm not sure whether it's best to compute overhead-per-property or
> > overhead-per-record (there is obviously a nontrivial cost to the
> > former, of course), and determining which is actually using the extra
> > quota would probably take more effort than it's worth, so I figured
> > I'd ask here.
>
> > Anyway, in terms of overhead, I'm observing (in this case)
> > approximately 600 bytes per record (in addition to the size of the
> > record's payload), or 150 bytes per property. In my particular case
> > (though I doubt this ratio is anything resembling constant), I'm
> > seeing 19 bytes of quota used for every byte of payload.
>
> > So I guess I have two big questions.
>
> > First, quota obviously measures something resembling the total size of
> > the datastore tables created by the user (their size on disk?), which,
> > due to indexing and such, is obviously going to be larger than the
> > size of the payload alone. Right?
>
> More or less, yes. The total size includes the size of the serialized Entity
> Protocol Buffer (which includes all the fields' values, their names ('int1'
> in your case), plus some encoding overhead), as well as several built-in
> indexes, plus any custom indexes you've built.
>
>
>
> > Second, how can I minimize this? Is the primary cost the added
> > indices, such that if I disable indexing for properties I won't be
> > querying on, I'll save space? (Again, this is something I could
> > experiment with, but at this point, why not just ask?) What's the
> > overhead per record, regardless of indexing? Are there any other steps
> > I can take to minimize my datastore quota usage?
>
> Disabling indexing for fields you don't require to be indexed is the
> simplest and most effective step. You can also use shorter field names, to
> reduce redundant storage requirement

[google-appengine] Re: Datastore quota usage far exceeds sum size of items in datastore

2009-07-27 Thread Nick Johnson (Google)
Hi Calvin,

On Sun, Jul 26, 2009 at 6:36 PM, Calvin Slayden wrote:

>
> Hello,
>
> After noticing my app using far more of the datastore quota than I
> expected, I conducted a little experiment with a toy app. The toy
> creates records of the format:
>
> class Record(db.Model):
>int1 = db.IntegerProperty()
># ... repeated three more times
>
> According to the App Engine documentation, IntegerProperty types in
> the datastore are stored as 64-bit integers, so each record consists
> of 8 * 4 = 32 bytes of "payload", plus overhead (key size * number of
> records, etc). I was hoping, with this experiment, to gain some
> insight into what that overhead costs.
>
> I uploaded 141,725 (sorry it's not a round number) of these records
> and waited a day for my quota to update; it now shows me using 0.09GB
> of my stored data quota. (In comparison, the amount of "payload" data
> in the store is only 141,725 x 32 = 4,535,200 bytes, or ~0.004GB.)
>
> I'm not sure whether it's best to compute overhead-per-property or
> overhead-per-record (there is obviously a nontrivial cost to the
> former, of course), and determining which is actually using the extra
> quota would probably take more effort than it's worth, so I figured
> I'd ask here.
>
> Anyway, in terms of overhead, I'm observing (in this case)
> approximately 600 bytes per record (in addition to the size of the
> record's payload), or 150 bytes per property. In my particular case
> (though I doubt this ratio is anything resembling constant), I'm
> seeing 19 bytes of quota used for every byte of payload.
>
> So I guess I have two big questions.
>
> First, quota obviously measures something resembling the total size of
> the datastore tables created by the user (their size on disk?), which,
> due to indexing and such, is obviously going to be larger than the
> size of the payload alone. Right?


More or less, yes. The total size includes the size of the serialized Entity
Protocol Buffer (which includes all the fields' values, their names ('int1'
in your case), plus some encoding overhead), as well as several built-in
indexes, plus any custom indexes you've built.


>
>
> Second, how can I minimize this? Is the primary cost the added
> indices, such that if I disable indexing for properties I won't be
> querying on, I'll save space? (Again, this is something I could
> experiment with, but at this point, why not just ask?) What's the
> overhead per record, regardless of indexing? Are there any other steps
> I can take to minimize my datastore quota usage?


Disabling indexing for fields you don't require to be indexed is the
simplest and most effective step. You can also use shorter field names, to
reduce redundant storage requirements there, if you wish. To make things
more user friendly, you can specify a name in your model:

  class MyModel(db.Model):
areallylongpropertyname = db.IntegerProperty(name='a')

But bear in mind that you'll have to use the shortened names when executing
queries.

-Nick Johnson


>
>
> (The application I'm actually developing stores a few hundreds of
> megabytes of data in three tables--two of them only necessary because
> of the limitations on the types of queries datastore can handle--and
> while when I was developing the application I estimated that we would
> be able to stay well within an affordable amount of quota usage, it
> now appears that we would be using something like 200GB of
> datastore.)
>
> Thanks,
> Calvin
>
> >
>


-- 
Nick Johnson, Developer Programs Engineer, App Engine

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to google-appengine@googlegroups.com
To unsubscribe from this group, send email to 
google-appengine+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en
-~--~~~~--~~--~--~---