If you have to get the data into parquet format for other reasons then I think count() on the parquet should be better. If it just the count you need using database sending dbTable = (select count(*) from <tablename> ) might be quicker, t will avoid unnecessary data transfer from the database to spark.
Hope that helps -suresh > On Feb 8, 2017, at 2:58 AM, Rohit Verma <rohit.ve...@rokittech.com> wrote: > > Hi Which of the following is better approach for too many values in database > > final Dataset<Row> dataset = spark.sqlContext().read() > .format("jdbc") > .option("url", params.getJdbcUrl()) > .option("driver", params.getDriver()) > .option("dbtable", params.getSqlQuery()) > // .option("partitionColumn", hashFunction) > // .option("lowerBound", 0) > // .option("upperBound", 10) > // .option("numPartitions", 10) > // .option("oracle.jdbc.timezoneAsRegion", "false") > .option("fetchSize", 100000) > .load(); > dataset.write().parquet(params.getPath()); > > // target is to get count of persisted rows. > > > // approach 1 i.e getting count directly from dataset > // as I understood this count will be transalted to jdbcRdd.count and > could be on database > long count = dataset.count(); > //approach 2 i.e read back saved parquet and get count from it. > long count = spark.read().parquet(params.getPath()).count(); > > > Regards > Rohit