Re: Testing ETL with Spark using Pytest

Sofia’s World Tue, 09 Feb 2021 10:09:30 -0800

Hey Mich
 my 2 cents on top of Jerry's.
for reusable @fixtures across your tests, i'd leverage conftest.py and put
all of them there  -if number is not too big. OW. as you say, you can
create  tests\fixtures where you place all of them there


in term of extractHiveDAta.... for a @fixture it is doing too much
A fixture in pytest - anyone correct if wrong - its just an object you can
reuse across tests, something like this below. it should contain very
minimal code.. I'd say not more than 3 lines..

@fixture
def spark():
     return SparkSession()....

def test_mydataframe(spark):
   mydf = spark.table("mypreferredtable")

It seems to me your extractHiveDAta is doing too much.
IMHO it should be:

@pytest.fixture
def hive_extractor():
     return <s>

@pytext.fixture
def default_config():
     return <a default instance of your config>

def test_extraction_from_hive(spark, hive_extractor, default_config):
      tableName = config['GCPVariables']['sourceTable']
   fullyQualifiedTableName = config['hiveVariables']['DSDB'] + '.' +
tableName
   house_df = s.loadTableFromHiveJDBC(spark_session,
fullyQualifiedTableName)
   # To test your dataframe, do something like this
   test_df_pandas  =  <pandas>.from_csv("""regionName,col2,col3
                                    Kensington and chelsea,Value2,Value3""")
   test_df = spark.createDataFrame(test_df_pandas)
   result_df = house_df.subtract(test_df)
   self.assertEquals(0, result_df.count())

as always, pls feel free to disagree.... havent done much on pytest/
fixtures but this is how i'd restructure......

hth
 Marco



On Tue, Feb 9, 2021 at 5:37 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Interesting points Jerry. I do not know how much atomising the unit test
> brings benefit.
>
> For example we have
>
> @pytest.fixture(scope = "session")
> def extractHiveData():
>     # read data through jdbc from Hive
>     spark_session = s.spark_session(ctest['common']['appName'])
>     tableName = config['GCPVariables']['sourceTable']
>     fullyQualifiedTableName = config['hiveVariables']['DSDB'] + '.' +
> tableName
>    house_df = s.loadTableFromHiveJDBC(spark_session,
> fullyQualifiedTableName)
>     # sample data selected equally n rows from Kensington and Chelsea and
> n rows from City of Westminster
>     num_rows = int(ctest['statics']['read_df_rows']/2)
>     house_df = house_df.filter(col("regionname") == "Kensington and
> Chelsea").limit(num_rows).unionAll(house_df.filter(col("regionname") ==
> "City of Westminster").limit(num_rows))
>     return house_df
>
> Notes:
>
> That spark_session is imported from a packaged and has been tested many
> times
>
> The config static values are read through a python file config.py in turn
> reading a yml file config.yml
>
> The important ones to test is house_df, the data frame to read from the
> Hive table. That can fail for a variety of reasons.
>
>
>    1. The Hive driver used is old or out of date
>    2. The Hive driver does not support kerberized access that may be the
>    case in production
>
> So any unit testing is going to be limited by scope. Also another point
> being is that if the extract data module fails then you are going to know
> that by calling it and probably can be rectified pretty quick. It is always
> the issue of coverage. How much testing needs to be covered.
>
>
> HTH
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 9 Feb 2021 at 16:34, Jerry Vinokurov <grapesmo...@gmail.com>
> wrote:
>
>> Sure, I think it makes sense in many cases to break things up like this.
>> Looking at your other example I'd say that you might want to break up
>> extractHiveData into several fixtures (one for session, one for config, one
>> for the df) because in my experience fixtures like those are reused
>> constantly across a test suite. In general I try to keep my fixtures to one
>> concrete task only, so that if I find myself repeating a pattern I just
>> factor it out into another fixture.
>>
>> On Tue, Feb 9, 2021 at 11:14 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Jerry for your comments.
>>>
>>> The easiest option and I concur is to have all these fixture files
>>> currently under fixtures package lumped together in conftest.py under
>>> * tests* package.
>>>
>>> Then you can get away all together from fixtures and it works. However,
>>> I gather plug and play becomes less manageable when you have a large number
>>> of fixtures (large being relative here). My main modules (not tests) are
>>> designed to do ETL from any database that supports JDBC connections (bar
>>> Google BigQuery that only works correctly with Spark API). You specify your
>>> source DB and target DB in yml file for any pluggable JDBC database
>>>
>>> Going back to Pytest, please  check this reference below for the reason
>>> for fixtures packaging
>>>
>>> How to modularize your py.test fixtures (github.com)
>>> <https://gist.github.com/peterhurford/09f7dcda0ab04b95c026c60fa49c2a68>
>>>
>>> With regard to your other point on fixtures (a fixture in each file), I
>>> have this fixture *loadIntoMysqlTable() *where it uses the data frame
>>> created in* extractHiveData*, reads sample records from Hive and
>>> populates MySql test table. The input needed is the Dataframe that is
>>> constructed in the fixture module extractHiveData which has been passed as
>>> parameter to this. This is the only way it seems to work through my tests
>>>
>>>
>>> @pytest.fixture(scope = "session")
>>> def extractHiveData():
>>>     # read data through jdbc from Hive
>>>     spark_session = s.spark_session(ctest['common']['appName'])
>>>     tableName = config['GCPVariables']['sourceTable']
>>>     fullyQualifiedTableName = config['hiveVariables']['DSDB'] + '.' +
>>> tableName
>>>    house_df = s.loadTableFromHiveJDBC(spark_session,
>>> fullyQualifiedTableName)
>>>     # sample data selected equally n rows from Kensington and Chelsea
>>> and n rows from City of Westminster
>>>     num_rows = int(ctest['statics']['read_df_rows']/2)
>>>     house_df = house_df.filter(col("regionname") == "Kensington and
>>> Chelsea").limit(num_rows).unionAll(house_df.filter(col("regionname") ==
>>> "City of Westminster").limit(num_rows))
>>>     return house_df
>>>
>>> @pytest.fixture(scope = "session")
>>> def loadIntoMysqlTable(*extractHiveData*):
>>>     try:
>>>         *extractHiveData*. \
>>>             write. \
>>>             format("jdbc"). \
>>>             option("url", test_url). \
>>>             option("dbtable", ctest['statics']['sourceTable']). \
>>>             option("user", ctest['statics']['user']). \
>>>             option("password", ctest['statics']['password']). \
>>>             option("driver", ctest['statics']['driver']). \
>>>             mode(ctest['statics']['mode']). \
>>>             save()
>>>         return True
>>>     except Exception as e:
>>>         print(f"""{e}, quitting""")
>>>         sys.exit(1)
>>>
>>> Thanks again.
>>>
>>>
>>> Mich
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 9 Feb 2021 at 15:47, Jerry Vinokurov <grapesmo...@gmail.com>
>>> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> I'm a bit confused by what you mean when you say that you cannot call a
>>>> fixture in another fixture. The fixtures resolve dependencies among
>>>> themselves by means of their named parameters. So that means that if I have
>>>> a fixture
>>>>
>>>> @pytest.fixture
>>>> def fixture1():
>>>>     return SomeObj()
>>>>
>>>> and another fixture
>>>>
>>>> @pytest.fixture
>>>> def fixture2(fixture1)
>>>>     return do_something_with_obj(fixture1)
>>>>
>>>> my second fixture will simply receive the object created by the first.
>>>> As such, you do not need to "call" the second fixture at all. Of course, if
>>>> you had some use case where you were constructing an object in the second
>>>> fixture, you could have the first return a class, or you could have it
>>>> return a function. In fact, I have fixtures in a project that do both. Here
>>>> they are:
>>>>
>>>> @pytest.fixture
>>>> def func():
>>>>
>>>>     def foo(x, y, z):
>>>>
>>>>         return (x + y) * z
>>>>
>>>>     return foo
>>>>
>>>> That's a fixture that returns a function, and any test using the func
>>>> fixture would receive that actual function as a value, which could then be
>>>> invoked by calling e.g. func(1, 2, 3). Here's another fixture that's
>>>> more like what you're doing:
>>>>
>>>>
>>>> @pytest.fixture
>>>> def data_frame():
>>>>
>>>>     return pd.DataFrame.from_records([(1, 2, 3), (4, 5, 6)], columns=['x', 
>>>> 'y', 'z'])
>>>>
>>>> This one just returns a data frame that can be operated on.
>>>>
>>>> Looking at your setup, I don't want to say that it's wrong per se (it
>>>> could be very appropriate to your specific project to split things up among
>>>> these many files) but I would say that it's not idiomatic usage of pytest
>>>> fixtures, in my experience. It feels to me like you're jumping through a
>>>> lot of hoops to set up something that could be done quite easily and
>>>> compactly in conftest.py. I do want to emphasize that there is no
>>>> limitation on how fixtures can be used within functions or within other
>>>> fixtures (which are also just functions), since the result of the fixture
>>>> call is just some Python object.
>>>>
>>>> Hope this helps,
>>>> Jerry
>>>>
>>>> On Tue, Feb 9, 2021 at 10:18 AM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> I was a bit confused with the use of fixtures in Pytest with the
>>>>> dataframes passed as an input pipeline from one fixture to another. I 
>>>>> wrote
>>>>> this after spending some time on it. As usual it is heuristic rather than
>>>>> anything overtly by the book so to speak.
>>>>>
>>>>> In PySpark and PyCharm you can ETTL from Hive to BigQuery or from
>>>>> Oracle to Hive etc. However, for PyTest, I decided to use MySql as a
>>>>> database of choice for testing with a small sample of data (200 rows). I
>>>>> mentioned Fixtures. Simply put "Fixtures are* functions, which will
>>>>> run before each test function to which it is applied, to prepare data*.
>>>>> Fixtures are used to feed some data to the tests such as database
>>>>> connections". If you have ordering like Read data (Extract), do
>>>>> something with it( Transform) and save it somewhere (Load), using Spark
>>>>> then these are all happening in memory with data frames feeding each 
>>>>> other.
>>>>>
>>>>> The crucial thing to remember is that fixtures pass functions to each
>>>>> other as parameters not by invoking them directly!
>>>>>
>>>>> Example  ## This is correct @pytest.fixture(scope = "session") def
>>>>> transformData(readSourceData):  ## fixture passed as parameter # this is
>>>>> incorrect (cannot call a fixture in another fixture) read_df =
>>>>> readSourceData()  So This operation becomes
>>>>>
>>>>>  transformation_df = readSourceData. \ select( \ ....
>>>>>
>>>>> Say in PyCharm under tests package, you create a package "fixtures"
>>>>> (just a name nothing to do with "fixture") and in there you put your ETL
>>>>> python modules that prepare data for you. Example
>>>>>
>>>>> ### file --> saveData.py @pytest.fixture(scope = "session") def
>>>>> saveData(transformData): # Write to test target table try: transformData. 
>>>>> \
>>>>> write. \ format("jdbc"). \ ....
>>>>>
>>>>>
>>>>> You then drive this test by creating a file called *conftest.py *under*
>>>>> tests* package. You can then instantiate  your fixture files by
>>>>> referencing them in this file as below
>>>>>
>>>>> import pytest from tests.fixtures.extractHiveData import
>>>>> extractHiveData from tests.fixtures.loadIntoMysqlTable import
>>>>> loadIntoMysqlTable from tests.fixtures.readSavedData import readSavedData
>>>>> from tests.fixtures.readSourceData import readSourceData from
>>>>> tests.fixtures.transformData import transformData from
>>>>> tests.fixtures.saveData import saveData from tests.fixtures.readSavedData
>>>>> import readSavedData
>>>>>
>>>>> Then you have your test Python file say *test_oracle.py* under
>>>>> package tests and then put assertions there
>>>>>
>>>>> import pytest from src.config import ctest
>>>>> @pytest.mark.usefixtures("extractHiveData") def
>>>>> test_extract(extractHiveData): assert extractHiveData.count() > 0
>>>>> @pytest.mark.usefixtures("loadIntoMysqlTable") def
>>>>> test_loadIntoMysqlTable(loadIntoMysqlTable): assert loadIntoMysqlTable
>>>>> @pytest.mark.usefixtures("readSavedData") def
>>>>> test_readSourceData(readSourceData): assert readSourceData.count() ==
>>>>> ctest['statics']['read_df_rows']
>>>>> @pytest.mark.usefixtures("transformData") def
>>>>> test_transformData(transformData): assert transformData.count() ==
>>>>> ctest['statics']['transformation_df_rows']
>>>>> @pytest.mark.usefixtures("saveData") def test_saveData(saveData): assert
>>>>> saveData
>>>>> @pytest.mark.usefixtures("readSavedData")
>>>>> def test_readSavedData(transformData, readSavedData): assert
>>>>> readSavedData.subtract(transformData).count() == 0
>>>>>
>>>>> This is an illustration from PyCharm about directory structure unders
>>>>> tests
>>>>>
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>>
>>>>> Let me know your thoughts.
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> Mich
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> http://www.google.com/profiles/grapesmoker
>>>>
>>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>

Re: Testing ETL with Spark using Pytest

Reply via email to