I was a bit confused with the use of fixtures in Pytest with the dataframes passed as an input pipeline from one fixture to another. I wrote this after spending some time on it. As usual it is heuristic rather than anything overtly by the book so to speak.
In PySpark and PyCharm you can ETTL from Hive to BigQuery or from Oracle to Hive etc. However, for PyTest, I decided to use MySql as a database of choice for testing with a small sample of data (200 rows). I mentioned Fixtures. Simply put "Fixtures are* functions, which will run before each test function to which it is applied, to prepare data*. Fixtures are used to feed some data to the tests such as database connections". If you have ordering like Read data (Extract), do something with it( Transform) and save it somewhere (Load), using Spark then these are all happening in memory with data frames feeding each other. The crucial thing to remember is that fixtures pass functions to each other as parameters not by invoking them directly! Example ## This is correct @pytest.fixture(scope = "session") def transformData(readSourceData): ## fixture passed as parameter # this is incorrect (cannot call a fixture in another fixture) read_df = readSourceData() So This operation becomes transformation_df = readSourceData. \ select( \ .... Say in PyCharm under tests package, you create a package "fixtures" (just a name nothing to do with "fixture") and in there you put your ETL python modules that prepare data for you. Example ### file --> saveData.py @pytest.fixture(scope = "session") def saveData(transformData): # Write to test target table try: transformData. \ write. \ format("jdbc"). \ .... You then drive this test by creating a file called *conftest.py *under* tests* package. You can then instantiate your fixture files by referencing them in this file as below import pytest from tests.fixtures.extractHiveData import extractHiveData from tests.fixtures.loadIntoMysqlTable import loadIntoMysqlTable from tests.fixtures.readSavedData import readSavedData from tests.fixtures.readSourceData import readSourceData from tests.fixtures.transformData import transformData from tests.fixtures.saveData import saveData from tests.fixtures.readSavedData import readSavedData Then you have your test Python file say *test_oracle.py* under package tests and then put assertions there import pytest from src.config import ctest @pytest.mark.usefixtures("extractHiveData") def test_extract(extractHiveData): assert extractHiveData.count() > 0 @pytest.mark.usefixtures("loadIntoMysqlTable") def test_loadIntoMysqlTable(loadIntoMysqlTable): assert loadIntoMysqlTable @pytest.mark.usefixtures("readSavedData") def test_readSourceData(readSourceData): assert readSourceData.count() == ctest['statics']['read_df_rows'] @pytest.mark.usefixtures("transformData") def test_transformData(transformData): assert transformData.count() == ctest['statics']['transformation_df_rows'] @pytest.mark.usefixtures("saveData") def test_saveData(saveData): assert saveData @pytest.mark.usefixtures("readSavedData") def test_readSavedData(transformData, readSavedData): assert readSavedData.subtract(transformData).count() == 0 This is an illustration from PyCharm about directory structure unders tests [image: image.png] Let me know your thoughts. Cheers, Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.