Hi Which of the following is better approach for too many values in database
final Dataset<Row> dataset = spark.sqlContext().read()
.format("jdbc")
.option("url", params.getJdbcUrl())
.option("driver", params.getDriver())
.option("dbtable", params.getSqlQuery())
// .option("partitionColumn", hashFunction)
// .option("lowerBound", 0)
// .option("upperBound", 10)
// .option("numPartitions", 10)
// .option("oracle.jdbc.timezoneAsRegion", "false")
.option("fetchSize", 100000)
.load();
dataset.write().parquet(params.getPath());
// target is to get count of persisted rows.
// approach 1 i.e getting count directly from dataset
// as I understood this count will be transalted to jdbcRdd.count and
could be on database
long count = dataset.count();
//approach 2 i.e read back saved parquet and get count from it.
long count = spark.read().parquet(params.getPath()).count();
Regards
Rohit