Hi, Is groupBy and partition are similar in this scenario? I know they are not similar and mean for different purpose but I am confused here. Still I need to do partitioning here to save into Cassandra ?
Below is my scenario. I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version. I have my spark-submit or spark cluster enviroment as below to load *2 billion records*. --executor-cores 3 --executor-memory 9g --num-executors 5 --driver-cores 2 --driver-memory 4g I am loading using spark dataframe into cassandra tables. After reading into spark data set I am grouping by on certain columns as below. Dataset<Row> dataDf = //read data from source . Dataset<Row> groupedDf = dataDf.groupBy("id","type","value" ,"load_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date") groupedDf.write().format("org.apache.spark.sql.cassandra") .option("table","product") .option("keyspace", "dataload") .mode(SaveMode.Append) .save(); Cassandra table( PRIMARY KEY (( id, type, value, item_code ), load_date) ) WITH CLUSTERING ORDER BY ( load_date DESC ) Basically I am groupBy "id","type","value" ,"load_date" columns. As the other columns ( "fiscal_year","fiscal_quarter" , "create_user_txt", "create_date") should be available for storing into cassandra table I have to include them also in the groupBy clause. 1) Frankly speaking I dont know how to get those columns after groupBy into resultant dataframe i.e groupedDf to store. Any advice here to how to tackle this please ? 2) With above process/steps , my spark job for loading is pretty slow due to lot of shuffling i.e. read shuffle and write shuffle processes. What should I do here to improve the speed ? While reading from source (into dataDf) do I need to do anything here to improve performance? Is groupBy and partition are similar ? Should I still need to do any partitioning ? If so , what is the best way/approach given the above cassandra table? Please advice me. thanks, Shyam https://stackoverflow.com/questions/57684972/is-groupby-and-partition-are-similar-how-to-improve-performance-my-spark-job-h