All,

This is very surprising and I am sure I might be doing something wrong. The
issue is, the following code is taking 8 hours. It reads a CSV file, takes
the phone number column, extracts the first four digits and then
partitions based on the four digits (phoneseries) and writes to Parquet.
Any clue on why ? The CSV file is just one million rows only. Thanks in
advance. Spark version is 3.0.1

*val df1 = spark.read.format("csv").option("header",
"true").load("file:///sparkcode/myjobs/csvs/*.csv")*

*//Above CSV contains a column names phonenumber which has a very side
range of values, total number //of rows in CSV is just below One million.
Also, test was done for just one file*


*val df =
df1.withColumn("phoneseries",df1("phonenumbercolumn").substr(1,5))*


*df.printSchema()  //Schema is printed correctly*


*df.write.option("header","true").partitionBy("phoneseries").mode("append").parquet("file:///sparkcode/mydest/parquet")*

Best,
Ravi

Reply via email to