All, This is very surprising and I am sure I might be doing something wrong. The issue is, the following code is taking 8 hours. It reads a CSV file, takes the phone number column, extracts the first four digits and then partitions based on the four digits (phoneseries) and writes to Parquet. Any clue on why ? The CSV file is just one million rows only. Thanks in advance. Spark version is 3.0.1
*val df1 = spark.read.format("csv").option("header", "true").load("file:///sparkcode/myjobs/csvs/*.csv")* *//Above CSV contains a column names phonenumber which has a very side range of values, total number //of rows in CSV is just below One million. Also, test was done for just one file* *val df = df1.withColumn("phoneseries",df1("phonenumbercolumn").substr(1,5))* *df.printSchema() //Schema is printed correctly* *df.write.option("header","true").partitionBy("phoneseries").mode("append").parquet("file:///sparkcode/mydest/parquet")* Best, Ravi