Hi,
Is groupBy and partition are similar in this scenario?
I  know they are not similar and mean for different purpose but I am
confused here.
Still I need to do partitioning here to save into Cassandra ?

Below is my scenario.

I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8
and apache cassandra 3.0 version.

I have my spark-submit or spark cluster enviroment as below to load *2
billion records*.

--executor-cores 3
--executor-memory 9g
--num-executors 5
--driver-cores 2
--driver-memory 4g

I am loading using spark dataframe into cassandra tables. After reading
into spark data set I am grouping by on certain columns as below.

Dataset<Row> dataDf = //read data from source .

Dataset<Row> groupedDf = dataDf.groupBy("id","type","value"
,"load_date","fiscal_year","fiscal_quarter" , "create_user_txt",
"create_date")



 groupedDf.write().format("org.apache.spark.sql.cassandra")
    .option("table","product")
    .option("keyspace", "dataload")
    .mode(SaveMode.Append)
    .save();

Cassandra table(
    PRIMARY KEY (( id, type, value, item_code ), load_date)
) WITH CLUSTERING ORDER BY ( load_date DESC )

Basically I am groupBy "id","type","value" ,"load_date" columns. As the
other columns ( "fiscal_year","fiscal_quarter" , "create_user_txt",
"create_date") should be available for storing into cassandra table I have
to include them also in the groupBy clause.

1) Frankly speaking I dont know how to get those columns after groupBy into
resultant dataframe i.e groupedDf to store. Any advice here to how to
tackle this please ?

2) With above process/steps , my spark job for loading is pretty slow due
to lot of shuffling i.e. read shuffle and write shuffle processes.

What should I do here to improve the speed ?

While reading from source (into dataDf) do I need to do anything here to
improve performance?

Is groupBy and partition are similar ? Should I still need to do any
partitioning ? If so , what is the best way/approach given the above
cassandra table?

Please advice me.

thanks,
Shyam


https://stackoverflow.com/questions/57684972/is-groupby-and-partition-are-similar-how-to-improve-performance-my-spark-job-h

Reply via email to