Re: Dataset count on database or parquet

Suresh Thalamati Thu, 09 Feb 2017 00:30:56 -0800

If you have to get the data into parquet format for other reasons   then I 
think count() on the parquet should be better.  If it just the count  you need 
using database  sending dbTable = (select count(*) from <tablename> ) might be 
quicker,  t will avoid unnecessary data transfer from the database to spark.



Hope that helps
-suresh

> On Feb 8, 2017, at 2:58 AM, Rohit Verma <rohit.ve...@rokittech.com> wrote:
> 
> Hi Which of the following is better approach for too many values in database
> 
>       final Dataset<Row> dataset = spark.sqlContext().read()
>                 .format("jdbc")
>                 .option("url", params.getJdbcUrl())
>                 .option("driver", params.getDriver())
>                 .option("dbtable", params.getSqlQuery())
> //                .option("partitionColumn", hashFunction)
> //                .option("lowerBound", 0)
> //                .option("upperBound", 10)
> //                .option("numPartitions", 10)
> //                .option("oracle.jdbc.timezoneAsRegion", "false")
>                 .option("fetchSize", 100000)
>                 .load();
>         dataset.write().parquet(params.getPath());
> 
> // target is to get count of persisted rows.
> 
> 
>         // approach 1 i.e getting count directly from dataset
>         // as I understood this count will be transalted to jdbcRdd.count and 
> could be on database
>         long count = dataset.count();
>         //approach 2 i.e read back saved parquet and get count from it. 
>         long count = spark.read().parquet(params.getPath()).count();
> 
> 
> Regards
> Rohit

Re: Dataset count on database or parquet

Reply via email to