Hi ,
i have dataframe with 1000 columns to dummies with stingIndexer
when i apply pipliene take  long times whene i want merge result with other
data frame

i mean  :
 originnal data frame + columns indexed by STringindexers

PB save stage it s long  why ?

code

     indexers  = [StringIndexer(inputCol=i, outputCol=i+"_index").fit(df)
for i in l]
     li = [i+"_index" for i in l]
     pipeline = Pipeline(stages=indexers)
     df_r = pipeline.fit(df).transform(df)
     df_r = df_r.repartition(500)
     df_r.persist()
     df_r.write().parquet(paths)

Reply via email to