Re: [sqlalchemy] Slow entities pickling for caching purposes

Mike Bayer Thu, 10 Oct 2019 10:34:03 -0700


On Thu, Oct 10, 2019, at 2:48 AM, Александр Егоров wrote:
> Hello!
> 
> I want to cache query results in a distributed cache (Redis), so I need to 
> serialize/deserialize fetched entities very fast.
> However, it turned that SQLAlchemy entities are very heavy for pickle to 
> dump/load. Regular classes and dicts with the same structure takes 
> significantly less amount of time to reconstruct from bytes.
> It's even slower than fetching and mapping the entities from a database.
> 
> Here are benchmark results, loading 10000 simple entities:
> 
> *class *Entity(Base):
>     __tablename__ = *'test_entity'**
**    *id = sa.Column(sa.Integer, primary_key=*True*, autoincrement=*True*)
>     field1 = sa.Column(sa.String)
>     field2 = sa.Column(sa.Integer)
> 
> 
> Name (time in ms)                      Median            StdDev           
> Dump bytes
> --------------------------------------------------------------------------------------------------------
> test_unpickle_dicts                    2.3764 (1.0)      0.0797 (1.09)      
> 337806
> test_unpickle_dataclasses              9.2546 (3.89)     0.0734 (1.0)       
> 437861
> test_unpickle_classes                  9.4188 (3.96)     0.1230 (1.68)      
> 437862
> test_load_from_database               91.3482 (38.44)    0.6686 (9.10)
> test_unpickle_unsaved_entities       108.6726 (45.73)    0.5154 (7.02)     
> 1448169
> test_unpickle_persisted_entities     166.4531 (70.04)    0.3787 (5.16)     
> 1697849
> ---------------------------------------------------------------------------------------------------------
> 
> *Environment*:
> OS: Ubuntu 16.04
> Python: CPython 3.6.3
> SQLAlchemy: 1.3.0
> Database: in-memory SQLite, PostgreSQL (all gives similar results)
> The full benchmark code is in a single attachment file.
> To run: 
> pytest test_ser.py --benchmark-columns=median,stddev --benchmark-disable-gc
> Requirements:
> pip install sqlalchemy==1.3.0 pytest==3.6.0 pytest-benchmark==3.1.1 
> dataclasses==0.6
> 
> *Questions:*
> 1) What strategy for caching I should consider while using SQLAlchemy? 
> Currently, the only option I see is to have a duplicated declaration of 
> entities in a form of simple classes and use it when I don't need 
> modification. Needles to say, it's a lot of code duplication.
> 2) Is it possible to have a read-only loader in SQLAlchemy? I can imagine 
> mapping loaded data to an entity class but without instrumentation overhead 
> which is not needed to read the data.
> If continue the idea further, I can imagine a situation when I load the 
> entity for modification, and it has a read-only relationship to another 
> entity. So that I don't pay for things I don't use.
> 
> Probably I'm making some crazy things and all wrong, feel free to point me 
> out. My ultimate goal is to manage caching with SQLAlchemy and understand 
> SQLAlchemy more in this direction.
> Thank you in advance!


Hi there -

there's no simple answer for this, have you considered writing your own 
serializer for your objects rather than using pickle's default strategy? that 
is, if you implement __getstate__ and __setstate__, you can allow the 
serialization to not include the "_sa_instance_state" that's associated with 
the objects which is where you'd get some of the slowness on dumping / loading, 
it also would save lots of memory / network overhead by not storing lots of 
data that you can probably forego for storing in your cache.

For bringing the classes back, you can implement a setstate that returns them 
as non-ORM classes, or patches them in as ORM classes.

I've altered your test to include only ORM mapped classes, comparing your 
baseline to one that uses __getstate__, and then several variants which will 
grant more or less ORM functionality once you have unpickled them. the very 
fastest one which omits any ORM functionality at all, and looks like what you 
propose, has a mean of 26 ms compared to 380 ms for the most expensive. 

there's one ORM-level guarantee you would want me to add for this most 
optimized version, that is if you have an ORM object that has no 
_sa_instance_state at all, you can still access attributes on it. This works 
but I would want to add test support so that I keep this as a supported use 
case.








> 

> --
>  SQLAlchemy - 
>  The Python SQL Toolkit and Object Relational Mapper
> 
> http://www.sqlalchemy.org/
> 
>  To post example code, please provide an MCVE: Minimal, Complete, and 
> Verifiable Example. See http://stackoverflow.com/help/mcve for a full 
> description.
>  --- 
>  You received this message because you are subscribed to the Google Groups 
> "sqlalchemy" group.
>  To unsubscribe from this group and stop receiving emails from it, send an 
> email to sqlalchemy+unsubscr...@googlegroups.com.
>  To view this discussion on the web visit 
> https://groups.google.com/d/msgid/sqlalchemy/75aa92b7-3c94-4b5a-afed-062a203792fd%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/sqlalchemy/75aa92b7-3c94-4b5a-afed-062a203792fd%40googlegroups.com?utm_medium=email&utm_source=footer>.
> 
> 
> *Attachments:*
>  * test_sqla_pickle.py

-- 
SQLAlchemy - 
The Python SQL Toolkit and Object Relational Mapper

http://www.sqlalchemy.org/

To post example code, please provide an MCVE: Minimal, Complete, and Verifiable 
Example.  See  http://stackoverflow.com/help/mcve for a full description.
--- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sqlalchemy+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sqlalchemy/b0deac49-cc34-4be2-9129-e4c478726147%40www.fastmail.com.

import pickle

import pytest

import sqlalchemy as sa
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import make_transient_to_detached
from sqlalchemy.orm import state as orm_state

Base = declarative_base()


class Entity(Base):
    __tablename__ = "test_entity"

    id = sa.Column(sa.Integer, primary_key=True, autoincrement=True)

    field1 = sa.Column(sa.String)
    field2 = sa.Column(sa.Integer)

    def __init__(self, id_: int, field1: str, field2: int):
        self.id = id_
        self.field1 = field1
        self.field2 = field2

    def some_method(self) -> str:
        return self.field1 + str(self.field2)


class EntityWGetState(Entity):
    """an entity with custom __getstate__ / __setstate__.

    By having a __getstate__ / __setstate__ that completely omits the
    _sa_instance_state, pickling is reduced to a very small level.
    The attributes which are present will be readable.

    However, you will not get:

    1. ability to add objects to the Session
    2. relationship loading
    3. lazy attribute loading
    4. ability to set new values without errors (unless you assign to __dict__
       directly), because there's no state tracker and the descriptors will
       fail.

    That is, you have a plain object here.  The mapped attributes do have the
    InstrumentAttribute python descriptors, which for "__getattr__", only look
    in __dict__ first, so there's no issue with the InstanceState not being
    present.   This could be made a guarantee in SQLAlchemy, currently it's
    not.

    """

    def __init__(self, id_: int, field1: str, field2: int):
        self.id = id_
        self.field1 = field1
        self.field2 = field2

    def some_method(self) -> str:
        return self.field1 + str(self.field2)

    def __getstate__(self):
        d = sa.inspect(self).dict

        return {"id": d["id"], "field1": d["field1"]}

    def __setstate__(self, state):
        self.__dict__.update(state)


class EntityWHeavierSetState(EntityWGetState):
    """"Adds ORM state to the object.

    This object, when deserialized, will act as a full ORM object, which
    you can add to the session, supports lazy loading, etc.

    """

    def __setstate__(self, state):
        # this is private-ish API but if it were useful it could be made
        # public
        self._sa_instance_state = orm_state.InstanceState(
            self, self.__mapper__.class_manager
        )

        self.__dict__.update(state)

        # this is the part that's mostly expensive, years of effort have gone
        # into speeding it up
        make_transient_to_detached(self)


class EntityWOptimizedSetState(EntityWHeavierSetState):
    """"Adds ORM state to the object, but omits some things you might
    not need.

    Inlines the make_transient_to_detached() process and skips two of the
    three main steps which you might not need if you don't need to load
    expired attributes immediately, but you can make changes to this object
    and persist it.

    """

    def __setstate__(self, state):
        # this is private-ish API but if it were useful it could be made
        # public
        self._sa_instance_state = instance_state = orm_state.InstanceState(
            self, self.__mapper__.class_manager
        )

        self.__dict__.update(state)

        # set the primary key.  this you need for any kind of Session work.
        # there's some public API version of this also I'd have to find in
        # the docs.
        instance_state.key = instance_state.mapper._identity_key_from_state(
            instance_state
        )

        # unlikely to need this since the InstanceState is brand new,
        # unless you want the expired attributes part
        # state._commit_all(state.dict)

        # allows unloaded attributes to be loaded when you access them
        # state._expire_attributes(state.dict, state.unloaded_expirable)


@pytest.fixture(scope="session")
def dbengine():
    engine = sa.create_engine("sqlite://")
    Base.metadata.create_all(engine)
    yield engine


@pytest.fixture(scope="function")
def session(dbengine):
    session_factory = sa.orm.sessionmaker(bind=dbengine)
    Session = sa.orm.scoped_session(session_factory)
    yield Session
    Session.query(Entity).delete()
    Session.remove()


def generate_data():
    for i in range(10000):
        yield i + 1, str(i), i * 2


def test_unpickle_unsaved_entities(benchmark):
    entities = [Entity(*data) for data in generate_data()]
    dump_unsaved_entities = pickle.dumps(entities)
    print(f"\nDump unsaved entities bytes {len(dump_unsaved_entities)}")
    benchmark(pickle.loads, dump_unsaved_entities)


def test_unpickle_persisted_entities(session, benchmark):
    entities = [Entity(*data) for data in generate_data()]
    session.add_all(entities)
    session.flush()
    session.expunge_all()
    entities = session.query(Entity).all()
    dump_persisted_entities = pickle.dumps(entities)
    print(f"\nDump persisted entities bytes {len(dump_persisted_entities)}")
    benchmark(pickle.loads, dump_persisted_entities)


def test_unpickle_unsaved_entities_getstate(benchmark):
    entities = [EntityWGetState(*data) for data in generate_data()]
    dump_unsaved_entities = pickle.dumps(entities)
    print(f"\nDump unsaved entities bytes {len(dump_unsaved_entities)}")
    benchmark(pickle.loads, dump_unsaved_entities)

def test_unpickle_persisted_entities_getstate(session, benchmark):
    entities = [EntityWGetState(*data) for data in generate_data()]
    session.add_all(entities)
    session.flush()
    session.expunge_all()
    entities = session.query(EntityWGetState).all()
    dump_persisted_entities = pickle.dumps(entities)
    print(f"\nDump persisted entities bytes {len(dump_persisted_entities)}")
    benchmark(pickle.loads, dump_persisted_entities)


def test_unpickle_unsaved_entities_getstate_heavy(benchmark):
    entities = [EntityWHeavierSetState(*data) for data in generate_data()]
    dump_unsaved_entities = pickle.dumps(entities)
    print(f"\nDump unsaved entities bytes {len(dump_unsaved_entities)}")
    benchmark(pickle.loads, dump_unsaved_entities)


def test_unpickle_persisted_entities_getstate_heavy(session, benchmark):
    entities = [EntityWHeavierSetState(*data) for data in generate_data()]
    session.add_all(entities)
    session.flush()
    session.expunge_all()
    entities = session.query(EntityWHeavierSetState).all()
    dump_persisted_entities = pickle.dumps(entities)
    print(f"\nDump persisted entities bytes {len(dump_persisted_entities)}")
    benchmark(pickle.loads, dump_persisted_entities)

def test_unpickle_unsaved_entities_getstate_heavy_optimized(benchmark):
    entities = [EntityWOptimizedSetState(*data) for data in generate_data()]
    dump_unsaved_entities = pickle.dumps(entities)
    print(f"\nDump unsaved entities bytes {len(dump_unsaved_entities)}")
    benchmark(pickle.loads, dump_unsaved_entities)


def test_unpickle_persisted_entities_getstate_heavy_optimized(session, benchmark):
    entities = [EntityWOptimizedSetState(*data) for data in generate_data()]
    session.add_all(entities)
    session.flush()
    session.expunge_all()
    entities = session.query(EntityWOptimizedSetState).all()
    dump_persisted_entities = pickle.dumps(entities)
    print(f"\nDump persisted entities bytes {len(dump_persisted_entities)}")
    benchmark(pickle.loads, dump_persisted_entities)

Re: [sqlalchemy] Slow entities pickling for caching purposes

Reply via email to