Testing ETL with Spark using Pytest

Mich Talebzadeh Tue, 09 Feb 2021 07:17:56 -0800

I was a bit confused with the use of fixtures in Pytest with the dataframes
passed as an input pipeline from one fixture to another. I wrote this after
spending some time on it. As usual it is heuristic rather than anything
overtly by the book so to speak.


In PySpark and PyCharm you can ETTL from Hive to BigQuery or from Oracle to
Hive etc. However, for PyTest, I decided to use MySql as a database of
choice for testing with a small sample of data (200 rows). I mentioned
Fixtures. Simply put "Fixtures are* functions, which will run before each
test function to which it is applied, to prepare data*. Fixtures are used
to feed some data to the tests such as database connections". If you have
ordering like Read data (Extract), do something with it( Transform) and
save it somewhere (Load), using Spark then these are all happening in
memory with data frames feeding each other.

The crucial thing to remember is that fixtures pass functions to each other
as parameters not by invoking them directly!

Example  ## This is correct @pytest.fixture(scope = "session") def
transformData(readSourceData):  ## fixture passed as parameter # this is
incorrect (cannot call a fixture in another fixture) read_df =
readSourceData()  So This operation becomes

 transformation_df = readSourceData. \ select( \ ....

Say in PyCharm under tests package, you create a package "fixtures" (just a
name nothing to do with "fixture") and in there you put your ETL python
modules that prepare data for you. Example

### file --> saveData.py @pytest.fixture(scope = "session") def
saveData(transformData): # Write to test target table try: transformData. \
write. \ format("jdbc"). \ ....


You then drive this test by creating a file called *conftest.py *under*
tests* package. You can then instantiate  your fixture files by referencing
them in this file as below

import pytest from tests.fixtures.extractHiveData import extractHiveData
from tests.fixtures.loadIntoMysqlTable import loadIntoMysqlTable from
tests.fixtures.readSavedData import readSavedData from
tests.fixtures.readSourceData import readSourceData from
tests.fixtures.transformData import transformData from
tests.fixtures.saveData import saveData from tests.fixtures.readSavedData
import readSavedData

Then you have your test Python file say *test_oracle.py* under package
tests and then put assertions there

import pytest from src.config import ctest
@pytest.mark.usefixtures("extractHiveData") def
test_extract(extractHiveData): assert extractHiveData.count() > 0
@pytest.mark.usefixtures("loadIntoMysqlTable") def
test_loadIntoMysqlTable(loadIntoMysqlTable): assert loadIntoMysqlTable
@pytest.mark.usefixtures("readSavedData") def
test_readSourceData(readSourceData): assert readSourceData.count() ==
ctest['statics']['read_df_rows'] @pytest.mark.usefixtures("transformData")
def test_transformData(transformData): assert transformData.count() ==
ctest['statics']['transformation_df_rows']
@pytest.mark.usefixtures("saveData") def test_saveData(saveData): assert
saveData
@pytest.mark.usefixtures("readSavedData")
def test_readSavedData(transformData, readSavedData): assert
readSavedData.subtract(transformData).count() == 0

This is an illustration from PyCharm about directory structure unders tests


[image: image.png]


Let me know your thoughts.


Cheers,


Mich


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Testing ETL with Spark using Pytest

Reply via email to