Hello Community,
I am working in pyspark with sparksql and have a very similar very complex list 
of dataframes that Ill have to execute several times for all the “models” I 
have.
Suppose the code is exactly the same for all models, only the table it reads 
from and some values in the where statements will have the modelname in it.
My question is how to prevent repetitive code.
So instead of doing somethg like this (this is pseudocode, in reality it makes 
use of lots of complex dataframes) which also would require me to change the 
code every time I change it in the future:

dfmodel1=sqlContext.sql("SELECT <quite complex query> FROM model1_table WHERE 
model =‘model1’ “).write()
dfmodel2=sqlContext.sql("SELECT <quite complex query> FROM model2_table WHERE 
model =‘model2’ “).write()
dfmodel3=sqlContext.sql("SELECT <quite complex query> FROM model3_table WHERE 
model =‘model3’ “).write()


For loops in spark sound like a bad idea (but that is mainly in terms of data, 
maybe nothing against looping over sql statements). Is it allowed to do 
something like this?


spark-submit withloops.py [“model1”,"model2”,"model3"]

code withloops.py
models=sys.arg[1]
qry="""SELECT <quite complex query> FROM {} WHERE model ='{}'"""
for i in models:
  FROM_TABLE=table_model
  sqlContext.sql(qry.format(i,table_model )).write()



I was trying to look up about refactoring in pyspark to prevent redundant code 
but didnt find any relevant links.



Thanks for input!

Reply via email to